Chapter 5: Basic Linear Regression

Transcription

1 Chapter 5: Basic Liear Regressio 1. Why Regressio Aalysis Has Domiated Ecoometrics By ow we have focused o formig estimates ad tests for fairly simple cases ivolvig oly oe variable at a time. But the core task of the huma scieces is to study the simultaeous iterrelatioships amog several variables. How will a icrease i price affect quatity demaded, how will law eforcemet affect deviat behavior, how will a chage i the Federal deficit affect iflatio these are all questios about the effect of oe variable upo aother. The oly tool we ve discussed for such questios is correlatio, ad we ve see that it has serious drawbacks. If we were doig a experimetal atural sciece, we might solve this problem by coductig cotrolled experimets, i which we oly chage oe variable at a time. This might isolate the relatioships amog variables without eedig too much statistical artillery. But usually this is t a optio i the huma scieces. So we must develop a statistical virtual laboratory, a meas of coordiatig our data so that we ca draw coclusios as if the data had bee geerated by a cotrolled experimet, teasig out the uique effect of each variable. Ad regressio aalysis is the most commo approach used by ecoomists to costruct such a virtual laboratory to explore simultaeous relatioships amog several variables. We ll start with the simplest case: oe variable affectig oe other variable. I the ext chapter we cosider the more realistic case, where several idepedet variables affect a depedet variable.. The Basic Regressio Paradigm Let s say that I lead you out to the wester sidelie of the college soccer field tomorrow morig at 9: a.m. with the followig istructios: Last eveig at 1: p.m. I buried a tiy 1cm caister uder 1/cm of soil, somewhere alog the opposite edge of this field. The caister cotais the etire college edowmet--$5,,. It s yours if you walk directly to it ad pick it up o your first try. As I buried the moey, I marked the path from this side of the field to the caister with a arrow piece of double-sided tape. Ufortuately, this tape completely disitegrates 1 hour after use. But, fortuately for you, the tape was covered with the eggs of a large Amazoia flea. These fleas hatch, take oe leap, ad die. Your missio, therefore, is to recostruct the positio of the tape, usig the flea carcasses as your guide. Why would I do this? OK, I m a little eccetric sometimes. But this exercise would exactly parallel the duties of regressio aalysis: We believe, for theoretical reasos, that there is some relatioship i the world betwee two variables say price ad quatity demaded at a Farmers Market booth o a Saturday morig: (Notice that I ve put the depedet variable o the vertical axis. Looks like tape o a field, o? ) Ufortuately, the itercept ad slope of the lie that measures this relatioship are populatio parameters we ca ot directly observe them. Istead we observe samples from this populatio, ad our samples are ulikely to lie exactly o top of the populatio regressio lie:

2 (This kid of picture is called a scatter plot or sample scatter diagram. ) There will be radom errors that move our observatios away from the regressio lie, because some thigs we ve left out of our model also affect quatity demaded (like chages i icome), the populatio relatioship may ot be perfectly liear (as we re assumig here), we may make some errors i measurig the two variables, ad there may simply be a truly radom, odetermiistic compoet to demad. I ay evet, the thig we observe is a cloud of measuremet aroud the populatio regressio lie, ot the lie itself. From this cloud of data, our job is to make the best guess about where the ivisible populatio lie actually lies: How to proceed? The majority of ecoometric work approaches this i typical ecoomist fashio: Let s reduce this problem to a simpler oe by makig several simplifyig assumptios. The i the succeedig chapters we will lear how to discover whether each assumptio is actually justified, idetify the problems that arise whe each assumptio is violated, ad try to costruct a fix for these problems. It will be a bit like practicig medicie: Wheever you do regressio aalysis, you ll fall ito the patter of diagosig violatios of assumptios, recogizig the symptoms of these violatios, prescribig a treatmet, ad moitorig the results. That s the most commo approach to doig ecoometrics. We should metio that there are several lesscommo approaches to the basic problem of iferrig populatio relatioships from sample observatios o-parametric estimatio, vector-autoregressive models, ad others. These are usually cosidered beyod the scope of a udergraduate course (ad eve may graduate courses) but might make good term-paper topics for the right perso. 3. The Classical Regressio Assumptios

3 If we were out o the soccer field tryig to recostruct the positio of the double-sided tape, we d have to make some assumptios about how the tape was placed ad how Amazoia fleas jump whe they hatch. The same is true of recostructig a populatio regressio lie from our sample observatios. Eight basic assumptios have prove useful: Assume that the relatioship betwee the variables is liear We ll fid easy ways to ease this assumptio, but it s a helpful place to start. I symbols, we re assumig that the populatio relatioship looks like this: Y X 5.1. That is, the depedet variable is a liear fuctio of a idepedet variable, plus-or-mius a radom error that we are callig (the Greek lower-case epsilo). ad are kow as the regressio coefficiets. represets the margial effect of X upo Y, the slope of the regressio lie. represets the itercept of the lie, which icorporates the effect upo Y of all the variables that ifluece Y but do ot appear i our equatio. (For a demad curve, these would iclude icome ad populatio size). If ay of these abset variables were to chage, would also chage the etire lie would shift. To be a bit picky, Equatio 5.1. is a referece to the etire populatio relatioship, the whole lie. We d deote ay particular occurrece of the depedet variable alog this lie with lower-case letters bearig subscripts, y, x where the lower-case subscript idicates that this particular occurrece of Y is oe of the (upper-case) N occurreces we ve measured. I pictures: Remember: It s usually lower-case letters for refereces to idividual observatios, upper-case letters for refereces to totalities. Of course we ever actually observe the or the because they re populatio parameters, so we ca ever actually kow the populatio lie s positio or measure the (the distace of our observatio from the populatio regressio lie). Istead we will evetually make estimates of these items, costructig a sample regressio lie that we ll deote Y X e So you might say that the basic regressio problem is to calculate estimates â, b ad ê for the populatio parameters, ad. We ll also wat to calculate the stadard errors of these three estimators, so we ca coduct hypothesis tests ad build cofidece itervals.

4 Havig explaied the first major assumptio ad clarified otatio, the remaiig seve assumptios will go more quickly. Assume that the error term is a radom variable with a mea of zero: E ) This assumes that, o average, the error terms cacel each other out some positive (placig our observatios above the true regressio lie), some egative, i roughly equal umbers ad distaces. I other words, the expected value of y depeds upo the value of x, ad this coditioal mea of y is equal to x, give a value of x. O the soccer field, this is like assumig that the fleas jump, i roughly equal umbers, toward either goal post. If you assumed this but they were istead magetized ad all jumped to the orth, your estimate of the populatio lie would be mistake. You d dig your hole a bit orth of the treasure. Assume that X is ot a radom variable; it is measured without error Our sample observatios do t lie precisely o top of the populatio regressio lie because of radom errors, ad ow we make aother limitig assumptio about those errors: They all pertai to the measuremet of Y oly, ot the measuremet of X. I our graphs, this meas that Price is measured with certaity, but the respose of Quatity to Price is somewhat radom. I effect, we assume that the sample data have jumped vertically away from the regressio lie before we could measure them. If this is true, it simplifies the process of fidig the best estimate of the regressio lie: Just fid a lie that miimizes the vertical distaces betwee your data ad your lie. This assumptio also has a pleasat implicatio about the covariace betwee the error term ad the idepedet variable: Cov x, ) E( x ) E( x ) E( ) x E( ) x E( ) 5.3. ( I words: The covariace betwee our idepedet variable ad the error term is zero; they are ot correlated. Equatio 5.3. will lead to some ice properties for the OLS estimators of the regressio lie. Maybe you ca picture them ow: Imagie how hard it would be to fid the treasure if the directio of the fleas jumps (the error term) were correlated with the idepedet variable (the distace you ve walked across the field searchig for treasure). If there were, say, a positive correlatio, the the closer you get to the treasure (that s icreasig the idepedet variable i our picture), the more the fleas ted to jump above the true regressio lie, ot below it; they would lead you to the left of where the treasure lies, ad you d ed up with a biased estimate of the true regressio lie. By the way, this assumptio is t quite the same thig as assumig that chages i X cause chages i Y, though it s close. Strictly speakig, regressio aalysis oly idicates a associatio betwee two variables, ot a cause-ad-effect relatioship betwee them. But, though there are formal tests of causality betwee two variables, the assumptio that oly oe variable experieces radom error early forces us to thik of that variable (Y) as somehow respodig to chages i the other variable (X). The ext chapter will cosider cases i which more tha oe idepedet variable affect a sigle depedet variable (Y). Later i the course we will ecouter simultaeous equatios models, which allow for more tha oe depedet variable. Assume that all radom errors are idetically distributed (costat variace): Var ( ) for all This property is called homoscedasticity, meaig equal scatter. If istead, for example, the variace of the error term icreased as X icreased, our data would look somethig like this: (

5 Assume that all radom errors are idepedetly distributed: Cov ) for all s ( s This property is called serial idepedece, meaig that the distributio of ay oe error term does't deped o the radom errors elsewhere i the series of data. This assumptio is most likely to be violated i time-series data, where oe observatio may be iflueced by precedig observatios. Cosider atioal GDP data: If oe quarter s GDP is well below potetial GDP, it s likely that the ext quarter will also be below potetial: Take together, 5..1, ad imply that the radom errors are idetically ad idepedetly distributed (i.i.d.) radom variables with a mea of zero ad fixed, fiite variace. Oe more assumptio about the error terms will prove helpful: Assume that the radom errors are ormally distributed This assumptio will allow us to get started i testig hypotheses ad formig cofidece itervals. Combiig these first seve assumptios, we ca summarize the basic liear regressio model: Y X, where ~ N (, ), Cov, ) for all, ( x Cov, ) for all s. ( s Folks sometimes summarize these assumptios by sayig that they ve assumed that the radom error term is well behaved. Those are all the major assumptios of the model per se, but we must make two additioal assumptios about the data we isert ito this model: Not all observed values of x may be idetical It would be hard to figure out how price iflueces quatity demaded if the price ever chaged. The

6 data would look like this, ad ay umber of straight lies would fit those observatios equally well: Hece it would be impossible to idetify the best estimates of ad. Stated differetly, this assumptio implies that the sample variace of the idepedet variable is ot zero: ( x x) 5.7. N 1 The umber of model parameters must ot be larger tha the umber of observatios (N) I the simple liear case, we have two model parameters to estimate: ad. We must have at least two sample observatios i order to estimate them. Two poits determie a lie, ad ay more poits are gravy. But try fittig a lie to a sigle poit. Ay lie through that poit will work, so it agai becomes impossible to idetify the best estimates of ad. There you have the eight classical assumptios of simple regressio aalysis. It s time to thik about how to actually calculate estimates of ad uder these assumptios. 4. Estimatig the Regressio Populatio Parameters: Ordiary Least Squares As was suggested while discussig Assumptio 5.3.1, we could fid parameter estimates by searchig for the lie that miimizes the vertical distaces betwee this lie ad the data. OLS estimatio is the most commo way of doig this, i which we miimize the squared vertical distaces betwee the data ad the estimated regressio lie. Why square the distaces? This coverts all errors to positive umbers (which makes the computatios more straightforward), ad amplifies the ifluece of outliers, observatios that lie farther from the core of our observatios. (You might thik it s ot a good idea to exaggerate these least-typical observatios, ad we will evetually ecouter other ways to estimate the regressio

7 parameters.) It also just happes that the OLS estimators have some desirable properties, which we explore i Sectio 5 of this chapter. Let s derive the OLS estimators: Our basic liear model holds that each data poit ca be described (from 5.1.4) as y. x Sice we wat to miimize somethig ivolvig the errors, let s solve this equatio for the estimated error (or residual ) term: y ( x ). 5.9 Now square this residual, the sum across all observatios, givig us the sum of squared errors (also called error sum of squares, abbreviated ESS), which depeds upo the umbers we select as estimators of ad. (If you choose a bad ad, you ll have bigger residuals.) I symbols:, ) [ y ( x )] ESS. 5.1 ( Now just choose a ad that will miimize this sum: y ( x )] Mi [ How would we miimize this? By takig first derivatives with respect to ad, settig these derivatives equal to zero, ad solvig for the optimal estimators ad. (The derivatio is rather tedious--9 pages i the textbook I like best--but oly requires the first ad seveth of our eight assumptios. The other assumptios are required to assure some desirable properties for these OLS estimators.) The first derivatives yield so-called ormal equatios, which ca be solved for ad to yield: sx, y sx y x, I prose: The OLS estimator of the slope equals the sample covariace betwee x ad y, divided by the sample variace of x. (Is t that great?! The simple covariace betwee x ad y had serious drawbacks i studyig the relatioship betwee two variables, but if we just divide it by the variace of x ad allow for a itercept, may of those drawbacks disappear! Ad we ll see that regressio aalysis opes up may other aveues of learig that are missed by correlatio ad covariace. Better Livig Through Calculus!) The OLS estimator of the itercept is derived from the estimate of the slope fid the slope first, the fid the oly itercept that allows a lie with that slope to pass through both the average value of y ad the average value of x. Log, log ago (1978), i a galaxy far away, youg college studets like your istructor derived ad calculated such estimators by had. Usig slide-rules to square ad sum deviatios ad cross-products, redefiig variable uits to ease the calculatios, keypuchig programs o eighty-space IBM cards (oe card per lie of program), stadig i log lies for prited output from scree-less maiframe computers... Those were the days. Now STATA will do this for eormous multi-dimesioal data sets, ad have the results o your scree before you ca glace up. For that reaso, we do t do may tedious problems ivolvig the ormal equatios ay more. While this is geerally a good thig, it sometimes leads to The Bubba Effect. It s so easy to do regressios that people become thoughtless, committig what you might call Type III error: the use of a good model i the wrog situatio. Whe aalysis was difficult, there were fewer tools at our disposal, but people did t use them casually. The silver liig is that, freed from some tedium, we ca sped more class time developig your cliical isticts about the actual practice of applied ecoometrics. That should help miimize the probability of

8 Type III error. This is why you are learig to write origial programs i a statistical laguage rather tha just poitig-ad-clickig, why you do a major research project for this class, why we sometimes digress ito the philosophy of sciece surroudig ecoometrics, ad why we sped a good deal of group time studyig cases of good ad bad ecoometrics. These are all ways of makig you a more thoughtful ecoometricia. By the way, machies are huma too, ad whe squarig ad summig lots of big umbers they ted to make roudig errors. It s therefore a good idea to measure your variables i large uits where possible, to keep your umbers small. For example, if GDP is your idepedet variable, measure it i trillios of dollars rather tha i dollars, so that the computer will be workig with smaller squared umbers. Just be sure to remember that the resultig slope coefficiet measures the margial effect of a oe-trillio-dollar chage i GDP, ot a oe-dollar chage. Havig discovered how to compute OLS estimators of the regressio parameters, let s move o to the properties of these estimators. 5. Properties of the OLS Estimators Chapter 4, Sectio 3 outlied some desirable small- ad large-sample properties for estimators. We preset, without proof, a scorecard of some of the OLS estimators attributes: Ubiased ad Cosistet: 5.14 Assumptios ad 3 assure that ad are ubiased. Assumptios 3 ad 7 assure that they are cosistet (as log as the sample variace of the idepedet variable is ot ifiitely large, which is ulikely). Efficiecy: BLUE estimators 5.15 Give Assumptios, 3, 4, 5 ad 7, the OLS estimators are the Best (most efficiet) Liear Ubiased Estimators (they are BLUE ). Amog the may liear combiatios of the data that form ubiased estimators, the OLS estimators are the most efficiet. The proof of this property is kow as the Gauss-Markov Theorem. You should be aware that, whe some of the assumptios do ot hold, the OLS estimators become iferior to other optios. The most commo other optio would ivolve so-called maximum likelihood estimatio. With maximum likelihood estimatio, we completely scrap the idea of miimizig squared deviatios betwee data ad estimated lie. Istead we specify the probability distributio (or likelihood fuctio ) of the error term (we ve assumed it s a ormal distributio, but it could be somethig else). This likelihood fuctio will deped i part o the values of ad. We the choose ad that will maximize the likelihood that we would have observed the data that were collected. You will be relieved to kow that MLE estimators: 5.16 If Assumptios through 8 are met, the OLS estimators are idetical to the maximum likelihood estimators. There are three more topics we should cosider about the simple liear regressio model, ad they all cocer practical matters of measurig the precisio of the OLS estimates. For example, it s cool comfort to kow the OLS estimators are the most efficiet available if their variaces are still very large. The three sectios that roud out this chapter discuss the precisio of the estimators, hypothesis testig, ad forecastig. 6. Estimator Variaces, Covariaces, ad Goodess of Fit The variace ad stadard error of a estimator are idexes of its reliability ad precisio. Estimators like ad are radom variables, of course, because they are liear combiatios of the estimated error terms, the terms, which are ormally-distributed radom variables. Thus the variaces of ad

9 might be expected to deped upo the variace of the radom error term (which we ll simply call rather tha wheever possible). I particular, it ca be show that ad Var ( ) 5.17 ( x x) Var I words: x ( ) ( x x) N The estimator of the regressio slope,, becomes more precise (has smaller variace) as 1)we sample a wider variety of x values (which makes the deomiator grow), or ) the error term s variace (i the umerator) is smaller, packig our data more tightly aroud the regressio lie. The estimator of the regressio itercept,, becomes more precise as 1) we sample a wider variety of x values, ) the error term s variace shriks, 3) our observatios are earer the vertical axis, where the itercept is (which shriks the first term i the umerator), or 4) our sample size (N) icreases. (Of course, a larger sample size will idirectly make the estimator more reliable, too, by icreasig the sum of squared deviatios of x. But there s a additioal direct effect of sample size upo the precisio of.) Our estimates of ad are ot idepedet of each other, as they re computed from the same data sample. Thus they ormally have a covariace. It ca be show that Cov(, ) x ( x x) I words: ad become less iterdepedet as 1) we sample a wider variety of x values (swellig the deomiator), ) the error term s variace shriks, or 3) our observatios are earer the vertical axis, shrikig x. The covariace betwee ad 4) has a sig opposite to the average value of x (sice the expressio begis with a egative sig, ad everythig i the expressio except x is always positive). The covariace is egative, as log as the idepedet variable is o-average positive. That all makes sese if you picture a lie beig fit to a cluster of data. If you were to push dow o the itercept while tryig to make the lie fit well, the slope would have to icrease a egative covariace betwee slope ad itercept (Item 4). Ad this bobbig of the regressio lie would be less proouced if the data are clustered ear the vertical axis (Item 3) ad tightly packed together (Items 1 ad ). All three of these expressios the variace, variace of, ad covariace betwee ad -- ivolve. Ufortuately, that s a populatio variace of the error term, which is usually ivisible, makig it hard to actually calculate the estimators variaces ad covariaces. But we ca estimate this

10 populatio error s variace, usig the sample variace of our sample error,. A ubiased estimator is the sample variace of our error terms: s N [ ( y x )] N, which is also sometimes called. 5. (You might have expected that we d divide by N-1, as we did whe calculatig the variace of a simple radom variable. But i that case we were dividig by the umber of degrees of freedom, which was N-1 because we had calculated oe parameter estimate already, the sample mea x, leavig us with N-1 free bits of iformatio. I this case, before we could calculate the terms we had to calculate two parameter estimates, ad, so we oly have N- bits of free iformatio (degrees of freedom) left.) The square root of this estimated error-term variace is called the stadard error of the residuals or stadard error of the regressio oted or s. As you might expect, if we swap this for the term i Equatios , we achieve estimates of the variaces ad covariaces of the regressio coefficiets ad. If we the take the square root of the estimated variaces, we have the stadard errors of the regressio coefficiet estimates: s s ( x x) ( x x x) N To summarize this sixth sectio of this chapter: We ve leared how to estimate the stadard error of the regressio, s (usually simply oted as s ) ad the stadard errors of the parameter estimators, s ad These are very useful because oce you kow somethig s stadard error you ca do hypothesis tests ad costruct cofidece itervals. Of course, you ll probably ever use Equatios 5.- to calculate these stadard errors, because they are routiely calculated by statistical software. But ow you kow where these umbers come from. Oe more loose ed to tie off: Because ad s ad s s give us a measure of the precisio of our estimates of, you might suppose that s is givig us a measure of the precisio of the whole regressio lie, all at oce, by measurig how tightly the data are packed aroud our estimated lie. That s the right istict, but it must be refied because s, as a estimated stadard error, is sesitive to the uits i which our variables are measured. But we ca costruct a closely-related statistic that s a more reliable measure of the goodess-of-fit betwee our data ad our regressio lie: Why are we doig a regressio? Ii order to explai some of the chages i Y by relatig them to chages i X. The maximum amout of squared deviatios i Y that we could possibly explai would be all of them! Call that umber the Total Sum of Squares, TSS: TSS ( y y) 5.3

11 How well has our regressio doe at aticipatig ad explaiig these variatios i Y? Why ot measure this by addig up our failures to explai, the squared distaces of our data from our regressio lie--the squared errors from our regressio aalysis. Call that umber the Error Sum of Squares: ESS y ( x )] 5.4 [ The differece betwee TSS ad ESS will be equal to the sum of squared deviatios i y that our regressio does correctly aticipate. These are squared distaces betwee our data ad the poits directly above or below them o the estimated regressio lie the Regressio Sum of Squares: RSS ] 5.5 ( y y) [( x ) y Look at the defiitios of TSS ad ESS. The first adds up deviatios aroud our sample mea of y; the secod adds up deviatios aroud the umbers that our regressio predicts to be the mea of y, give the level of x we ve observed. The first measures total variatio i y, the secod measures total uexplaied variatio i y. So you could measure the percetage of the variatio i y that our ESS. regressio fails to explai with the ratio TSS We ca restate this by calculatig the percetage of variatio i y that our regressio does explai, the coefficiet of determiatio, oted R : R 1 ESS TSS RSS TSS, R R is a proportio, so it is uaffected by chages i the uit of measuremet of our variables; it s a uitfree measure of the goodess of fit betwee our data ad our regressio lie, because the umerator ad deomiator are measured i the same uits. R lies betwee ad 1 because we ca t explai more tha 1% of the variatio i y, or less tha %. By the way, if you re straiig to see how this measure of the regressio s goodess of fit, I ca help: R is related to R s, which we set out to improve upo as a 1 s y s ( N If you re stymied by the ame R, it s due to the fact that this statistic is equal to the square of the sample correlatio (abbreviated r) betwee the observed values of y ad our regressio s predicted values for y (which are x, or ŷ ). Though R is widely used as a measure of goodess of fit, we ll see that it has some limitatios. Two of them would become clear after starig at our derivatio: You ca ot use R to compare the goodess of fit betwee two regressios if 1) oe regressio cotais a itercept ad the other does ot, or if ) the depedet variables of the two regressios are ot the same (for example, if oe uses y ad the other uses l(y)). 7. Hypothesis Testig We ve discussed four regressio statistics that you might wat to hypothesize about:,, We ll defer tests cocerig ( N ) 1). ad R. R util the ext chapter, whe we ca be more robust about it. The most commo test asks whether is zero. That would idicate that there s o relatioship betwee X ad Y, ad explorig this relatioship was presumably our reaso for doig the regressio. For completeess I ll summarize the typical tests for all three regressio parameters, the preset oe ew twist o hypothesis testig. The tests rely o three premises (the proofs of which you ll fid i graduate statistics texts):

12 ad are ormally distributed (sice they re derived from regressio errors, which are presumed to be ormally distributed). ad are distributed idepedetly of. ~ N Test for : 5.7 H : (where H A : (two-tailed test), or A : is some umber you supply, frequetly zero) H or (oe-tailed tests) Test Statistic uder H : t ~ t N s Decisio Rule: Reject H if t (two-tailed test) or t tc or t tc (oe-tailed tests), tc where t c is a critical value determied by the level of sigificace. I words: To test whether the regressio slope is equal to a particular umber ( ), fid how may stadard deviatios your estimate lies from that umber. The greater this distace betwee estimate ad, the less believable becomes. Test for : 5.8 Idetical. Replace each i 5.7 with a, ad you have it. Test for H : H : : 5.9 Test Statistic uder H : S Decisio Rule: Reject H if S ( N c ) ~ Ad ow the ew twist: Recall that there s really o magic level of sigificace that s uiversally appropriate. A hypothesis test that simply chooses a level of sigificace ad reports a success or failure leaves us vaguely usatisfied: If the hypothesis failed, by how much? If it did t satisfy your tolerace for sigificace, it might have satisfied mie. For that reaso it s become commo to report a p-value for each estimated coefficiet, where the p-value equals the level of sigificace at which your ull hypothesis would have just barely passed the hypothesis test. Quick Example: Say that you ve doe a regressio of quatity demaded (measured i bushels of cucumbers) o price, with the followig results: (Stadard errors of estimates lie below the parameter estimates) Q P (5.3) (.5) N R. 78 N s 37. 8

13 , your slope estimate i a regressio, is equal to 5., ad the estimated stadard error of is.5. The sample size equals. To test the ull hypothesis that hypothesis that, your test statistic would be i the populatio agaist the alterative 5. t. ~ t. is. stadard.5 deviatios away from the value presumed i your ull hypothesis.. is greater tha the.1-sigificacetest critical value (1.75, from STATA or a t-distributio table), so we d reject the ull hypothesis at a.1 level of sigificace. But. is less tha the.5-sigificace-level critical value (.86), so we d (barely) retai the ull hypothesis at a.5 level of sigificace. The p-value for this test is.5966, which gives us much more precise iformatio tha either of the other tests. 8. Forecastig We ofte do a regressio because we d like to forecast values of the depedet variable. Say you d doe the regressio reported i the last quick example, Q P R. 78 N (5.3) (.5) s, because you woder how may bushels of cucumbers will be bought at a price of $4 per bushel. Sice Q ( quatity, ad are ubiased estimators of ad, you d get a ubiased estimate of P 4. give that price equals 4. ) by just settig price equal to 4. i your regressio equatio ad solvig for the forecast level of Q. This yields a poit estimate equal to 4. 5.(4.) 38.. You d aturally like some idicatio of the reliability of this poit estimate. If we kew the stadard error of this predictor, we could calculate a cofidece iterval. Let s say you d like a 95% cofidece iterval for the quatity demaded at a price of 4.. At this poit we must be rather precise, because there are two differet, closely-related cofidece itervals that might iterest us: We could costruct a iterval that, with 95% certaity, captures the poit o the regressio lie that lies above a price of 4.. We could costruct a iterval that captures 95% the demad coditios we ll actually experiece at a price of 4.. The first optio gives us a rage withi which the average of our sales is likely to fall, a cofidece iterval for the mea predictor. We re 95% sure that, at a price equal to 4., the populatio regressio lie lies betwee these two poits. If we costructed such a rage for each possible level of price, we d have costructed a space withi which we re 95% sure that we ve captured the true populatio regressio lie:

14 (The iterval gets wider as we move farther from our average observatio of X, because we re forecastig farther from the core of the iformatio we ve gathered.) The secod optio gives us a wider rage withi which the actual level of our sales is likely to fall, a cofidece iterval for poit forecasts. This has to be a wider rage, sice actual sales vary aroud their average levels: The stadard error of the mea estimator (the first optio) ca be estimated by 1 ( x x) s s [ ]. y N ( x x) I will spare you the proof. I words, we ca forecast the mea value of y at ay particular value x of the idepedet variable; the stadard error of that forecast is larger 1) if the estimated variace s of our error term is larger if the data are widely dispersed aroud our regressio lie; ) if our sample size, N, is small; 3) if we are tryig to forecast far away from the average observatio of x; ad 4) if our observatios of x are ot very well dispersed. Sice we ca compute the stadard error of our forecasted mea, we ca employ the usual logic to costruct a cofidece iterval for the mea forecast:, y s y t c where t c is a critical t-statistic value that depeds o our level of cofidece. I words, we are c% sure that the forecasted mea lies withi t stadard deviatios of ŷ.

15 For poit forecasts (the secod optio), the relevat stadard error is 1 ( x x) s y s [ 1]. N ( x x) Compared to the previous stadard error, there s oe small differece: We ve added i a extra s, because idividual observatios vary aroud their expected value, with a variace we ve estimated to be s. By usig this s you ca compute cofidece itervals i the ormal way, so that your cofidece iterval for a poit forecast would be. y s y t c ŷ 9. Comparig Forecasts Imagie that two farmers from our market have developed slightly differet regressios to forecast their sales. They might wat a way to compare the approaches, to decide which makes more reliable forecasts. Call the forecasted value of the depedet variable y, ad the actual value that evetually occurred y. Here are three typical scorecards you might compute after makig several forecasts: (For all three, a low score is better tha a high score.) Mea Squared Error (MSE): f ( y y ) MSE N Looks like a variace, does t it? Some prefer its square root, root mea squared error. Mea Absolute Percet Error (MAPE), which oly works if all y are positive: MAPE 1 [ N y f y y ] 1 Mea Squared Percetage Error (MSPE): MSPE 1 [ N ( y f y y ) ] 1 Agai, some prefer its square root, root mea squared percetage error. f Each approach has its champios ad detractors, ad your choice of a scorecard for forecasts probably should deped upo the situatio. I some cases the amout of error is more importat tha the percetage error, i others ot; i some cases you wat to severely pealize large errors by squarig them, but some times this would be iappropriate. If you re impatiet ad wat to evaluate several forecasters without waitig for additioal observatios, all is ot lost. You ca estimate your regressios usig oly a percetage of the observatios you have (say, 9%), use the regressios to make forecasts of y at the x values i your uused observatios, the compute the MSE or MAPE or MSPE for these so-called post-sample forecasts. 1. More Regressios to Come Simple liear regressio is powerful, but ot powerful eough for a complicated world i which we ca t ru cotrolled experimets. I the ext chapter we cosider expadig our model to cases with more tha oe explaatory variable. I the succeedig chapters we will relax most of our eight simplifyig assumptios, learig how to adjust our aalysis whe the assumptios are ot met.

16 Useful STATA Commads: Values you supply are i italics. Words you type literally are i boldface. Optios are i [ ] you do t type. Eterig data: *From the Keyboard, withi STATA: clear /* to clear ay existig data i memory */ iput ames /*ames: 1-8 characters, period for missig values; STATA is case-sesitive */ /* eter observatios oe at a time, space betwee variables, startig ext lie */ ed /* later ew observatios: iput... ed */ /* later ew variables: iput ames */ * Outside STATA, you ca either * eter data i a text file (usig somethig like WordPad), separatig each variable by spaces ad * each observatio by a carriage retur, ad save it as * fileame.asc,, the use the Stata ifile commad: ifile variablelist usig fileame.extesio. /* strig variables: ifile strxx varame where <xx<81 ad xx=striglegth */ /* i file, strigs go i marks if they cotai blaks */ *or * put your data ito a Excel file, usig the first row for variable ames. Save it as a tab-delimited * ASCII file, ad use the Stata commad isheet usig fileame * Stata will automatically read the variable ames from your file. Seeig data: list summary describe Savig data: save fileame /*may ot iclude blaks; saved as fileame.dta */ /* use save fileame, replace if overwritig a existig file*/ Re-usig saved Stata data: use fileame /*save or clear curret data i memory first if ecessary */ Usig o-stata (ASCII) data files that are ot i the same directory as STATA: *For example, if a data file is i a commo directory i the lab, take this approach: *a. Fid the data file you wat to use, i the commo drive. Drag a copy of it oto your desktop if you wish. *b. Ope the STATA program, ad issue the first part of the INFILE commad: ifile variableames usig *Now, rather tha tryig to type out the address of the file correctly, just go up to the FILE meu, *ad choose the FILENAME optio. You'll get a dialog box. Navigate to the desktop, ad click o *the ame of your data file. This ame will automatically appear i the commad lie you were *typig, eclosed i " " marks. Now you ca just tap your ENTER key, ad the commad should *brig the data ito the program for you.

17 Alterig Variables: geerate ewvariable =expressio [if expressio] [i rage] replace oldvariable=expressio [if expressio] [i rage] *For creatig or alterig parameters, use scalar scalarame=expressio /* you ca follow this with scalar list scalarame or scalar drop */ *Expressios ca iclude fuctios: abs(x), exp(x), l(x), log1(x), sqrt(x), or statistical fuctios. *For correlatios, use correlate [variablelist] [weight] [if expressio] [i rage] [,meas covariace] *For pairwise correlatios oly, you ca use pwcorr [variablelist] [weight] [if expressio] [i rage] Regressio: regress depvariable idepvars [weight] [if expressio] [i rage] [ocostat] /* Saved results iclude: e(n) # observatios e(df_m) model degrees of freedom e(df_r) residual degrees of freedom e(r) R-squared e(f) regressio F-statistic e(rmse) root mea square error e(b) coefficiet vector e(v) var-cov matrix of estimators */ *See these with estimates list or, for the matrixes, matrix list matrixame predict ewvariableame [if expressio] [i rage] [, statistic] /* geerates predicted values, where statistic ca be pr(a,b) probability a<y<b residuals residuals rstadard stadardized residuals stdp stadard error of the predictio stdf stadard error of forecast stdr stadard error of the residual Savig your program ad/or results: log usig fileame.do [oproc apped replace] /* saves your file as a do file, which ca be edited ad ru agai */ /* the oproc optio saves oly what you type, ot ay output */ /* apped will apped your work to the ed of a existig file */ /* replace will overwrite a existig file */ *To susped, To resume, To quit loggig for good, log off log o log close To quit usig STATA: exit, clear /*but be sure you ve saved your data first, if you eed to */ Re-ruig your.do files: *Just type do fileame.do /*where fileame.do is the.do file cotaiig the commads you wat to ru */