Fscal Studes (2000) vol. 21, no. 4, pp. 427 468 Evaluaton Methods for Non- Expermental Data RICHARD BLUNDELL and MONICA COSTA DIAS * Abstract Ths paper presents a revew of non-expermental methods for the evaluaton of socal programmes. We consder matchng and selecton methods and analyse each for cross-secton, repeated crosssecton and longtudnal data. The methods are assessed drawng on evdence from labour market programmes n the UK and n the US. JEL classfcaton: J38, H3, C2. I. AN OVERVIEW OF THE EVALUATION PROBLEM The evaluaton problem of concern here s the measurement of the mpact of a polcy reform or nterventon for example, a chldcare subsdy or a targeted tranng programme on a set of well-defned outcome varables. For the former example nterventon, the outcome varables mght nclude the chld s exam results or the mother s labour market partcpaton, whle, for the latter, they could nclude ndvdual employment duratons, earnngs and/or unemployment duratons. Usually, ndvduals are dentfed by some observable type for example, gender, age, educaton, locaton or martal status. The evaluaton problem, therefore, s to measure the mpact of the programme on * Unversty College London and Insttute for Fscal Studes. The authors thank Costas Meghr, Barbara Sanes, John Van Reenen, the edtor and a referee for helpful comments. Ths revew was orgnally prepared for the Department for Educaton and Employment and s also part of the programme of research of the Economc and Socal Research Councl (ESRC) Centre for the Mcroeconomc Analyss of Fscal Polcy at the Insttute for Fscal Studes. Co-fundng from the Leverhulme Trust s gratefully acknowledged. The second author acknowledges fnancal support from Sub-Programa Cênca e Tecnologa do Segundo Quadro Comuntáro de Apoo, grant number PRAXIS XXI/BD/11413/97. The usual dsclamer apples. Insttute for Fscal Studes, 2000
Fscal Studes each type of ndvdual. It can be regarded as a mssng-data problem snce, at a moment n tme, each person s ether n the programme under consderaton or not, but not both. If we could observe the outcome varable for those n the programme had they not partcpated, there would be no evaluaton problem. Thus constructng the counterfactual s the central ssue that evaluaton methods address. There are many references n the lterature that document the development of the analyss of the evaluaton problem. In the labour market area, from whch we draw heavly n ths revew, the orgnal papers that use longtudnal data are those by Ashenfelter (1978), Ashenfelter and Card (1985) and Heckman and Robb (1985 and 1986). Evaluaton methods n emprcal economcs fall nto fve broad and related categores. Implctly, each provdes an alternatve approach to constructng the counterfactual. The frst s the pure randomsed socal experment. In many ways, ths s the most convncng method of evaluaton snce there s a control (or comparson) group whch s a randomsed subset of the elgble populaton. The lterature on the advantages of expermental data was developed n papers by Bass (1983 and 1984) and Hausman and Wse (1985) whch were based on earler statstcal expermental developments (see Cochrane and Rubn (1973) and Fsher (1951), for example). A properly defned socal experment can overcome the mssng-data problem. For example, n the desgn of the study of the Canadan Self-Suffcency Project reported n Card and Robns (1998), the labour supply responses of approxmately 6,000 sngle mothers n Brtsh Columba to an n-work beneft programme, n whch half those elgble were randomly excluded from the programme, were recorded. Ths study has produced nvaluable evdence on the effectveness of fnancal ncentves n nducng welfare recpents nto work. Of course, experments have ther own drawbacks. Frst, they are rare n economcs and typcally expensve to mplement. Second, they are not amenable to extrapolaton. That s, they cannot easly be used n the ex ante analyss of polcy reform proposals. Fnally, they requre the control group to be completely unaffected by the reform, typcally rulng out spllover, substtuton and equlbrum effects on wages etc. None the less, they have much to offer n enhancng our knowledge of the possble mpact of polcy reforms. Indeed, a comparson of results from non-expermental data wth those obtaned from expermental data can help assess approprate methods where expermental data are not avalable. For example, the mportant studes by LaLonde (1986), Heckman, Ichmura and Todd (1997) and Heckman, Smth and Clements (1997) use expermental data to assess the relablty of comparson groups used n the evaluaton of tranng programmes. A second popular method of evaluaton s the so-called natural experment. Ths approach typcally consders the polcy reform tself as an experment and tres to fnd a naturally occurrng comparson group that can mmc the propertes of the control group n the properly desgned expermental context. 428
Evaluaton Methods for Non-Expermental Data Ths method s also often labelled dfference-n-dfferences snce t s usually mplemented by comparng the dfference n average behavour before and after the reform for the elgble group wth the before and after contrast for the comparson group. Under certan condtons, ths approach can be used to recover the average effect of the programme on those ndvduals who entered nto the programme or those ndvduals treated by the programme thus measurng the average effect of the treatment on the treated. It does ths by removng unobservable ndvdual effects and common macro effects. However, t reles on the two crtcally mportant assumptons of common tme effects across groups and no composton changes wthn each group. 1 Together, these assumptons make choosng a comparson group extremely dffcult. For example, n ther heavly cted evaluaton study of the mpact of Earned Income Tax Credt reforms on the employment of sngle mothers n the US, Essa and Lebman (1996) use sngle women wthout chldren as a control group. However, ths comparson can be crtcsed for not capturng dfferental macro effects. In partcular, ths control group s already workng to a very hgh level of partcpaton n the US labour market (around 95 per cent) and therefore cannot be expected to ncrease ts level of partcpaton n response to the economy comng out of a recesson. In ths case, all the expanson n labour market partcpaton n the group of sngle women wth chldren wll be attrbuted to the reform tself. A thrd approach s the matchng method. Ths has a long hstory n nonexpermental statstcal evaluaton (see the references n Heckman, Ichmura and Todd (1997)). The am of matchng s smple. It s to select suffcent observable factors that any two ndvduals wth the same values of these factors wll dsplay no systematc dfferences n ther reactons to the polcy reform. Consequently, f each ndvdual undergong the reform can be matched wth an ndvdual wth the same matchng varables who has not undergone the reform, the mpact of the reform on ndvduals of that type can be measured. As n the choce of control group n a natural experment, t s a matter of fath as to whether the approprate matchng varables have been chosen. If they have not, the counterfactual effect wll not be correctly measured. Agan, expermental data can help here n evaluatng the choce of matchng varables, and ths s precsely the motvaton for the Heckman, Ichmura and Todd (1997) study. As we document below, matchng methods have been extensvely refned n the recent evaluaton lterature and are now a valuable part of the evaluaton toolbox. The fourth approach s the selecton model. Developed by Heckman (1979), t was fully ntegrated nto the evaluaton lterature n Heckman and Robb (1985 and 1986). Ths approach reles on an excluson restrcton, whch requres a varable that determnes partcpaton n the programme but not the outcome of the programme tself. In contrast to matchng, whch can be consdered as 1 See Blundell, Duncan and Meghr (1998) for a precse descrpton of these condtons. 429
Fscal Studes selecton on the observables, the Heckman approach accounts for selecton on the unobservables. A comparson of these two approaches turns out to be extremely nformatve n understandng the advantages and drawbacks of these methods. The fnal approach s the structural smulaton model. Ths approach s closely related to the selecton model and has long been at the centre of tax reform evaluaton where behavour can often be reasonably modelled by some ratonal choce framework (see Blundell and MaCurdy (1999) for a revew). It has the advantage of separatng preferences from constrants and can therefore be used to smulate new polcy reforms that change the constrant whle leavng preferences unaffected. Moreover, ths approach can feed nto some overall general equlbrum evaluaton. However, these models requre a belevable behavoural model for ndvduals, somethng the expermental and quasexpermental approaches gnore by desgn. Approprate evaluaton methods therefore depend on several overall crteron: () the nature of the programme that s, whether t s local or natonal, smallscale or global ; () the nature of the queston to be answered that s, the overall mpact, the effect of treatment on the treated or the extrapolaton to a new polcy reform; and () the nature of the data avalable. Wth regard to the nature of the data, there are a number of ssues. Does the dataset contan nformaton for ndvduals before and after ther programme partcpaton? Are smlar questonnares admnstered to potental comparson groups or are we to use other survey data to construct comparsons? In some studes, comparson groups are chosen n the same locaton and asked to respond to the same questonnare as those n the programme. In other studes, a comparson group has to be drawn from ndvduals who are much less lkely to be smlar. Ths turns out to be crtcal n the mplementaton of matchng methods, whch we dscuss n detal below. Ths paper s organsed as follows. Our am s to dscuss evaluaton methods when expermental data are not avalable. We outlne the measurement problem n Secton II and consder the types of data and ther mplcaton for the choce of evaluaton method n Secton III. Secton IV s the man focus of ths paper as t presents a detaled comparson of alternatve methods of evaluaton for nonexpermental data. In Secton V, we llustrate these methods drawng on recent applcatons n the evaluaton lterature. Secton VI concludes. II. WHAT ARE WE TRYING TO MEASURE? An mportant decson to be made when evaluatng the mpact of a programme s whether to assume homogeneous or heterogeneous treatment effects. Typcally, we do not expect all ndvduals to respond to a polcy nterventon n exactly the same way. That s, there wll be heterogenety n the mpact across ndvduals. Consequently, there are two possble questons that evaluaton methods attempt 430
Evaluaton Methods for Non-Expermental Data to answer. The frst s the measurement of the mpact of the programme on ndvduals of a partcular type as f they were assgned to such a programme randomly from the populaton of all people of that type. The second s the mpact on ndvduals of a partcular type among those who were assgned to the programme. Under the assumpton of homogeneous treatment effects, these two measures are dentcal, but ths s not so when treatment effects can vary. In ths case, the latter measure s often referred to as the effect of treatment on the treated. 1. Homogeneous Treatment Effects To make thngs more precse, suppose there s a polcy reform or nterventon for whch we want to measure the mpact on some outcome varable, Y. Ths outcome s assumed to depend on a set of exogenous varables, X, and on a dummy varable, d, such that d = 1 f ndvdual has partcpated n the programme and d = 0 otherwse. For ease of exposton, we wll assume that the programme takes place n perod k, so that, n each perod t, (1) Y = X β + dα + U f t > k t t t Y = X β + U f t k, t t t where α measures the homogeneous mpact of treatment for ndvdual. 2 The set of parameters β n (1) defne the relatonshp between the exogenous varables X and the dependent varable Y, and U t s the error term of mean zero, whch s assumed to be uncorrelated wth X. Except n the case of expermental data, assgnment to treatment s most probably not random. As a consequence, the assgnment process s lkely to lead to a non-zero correlaton between enrolment n the programme represented by d and the error term n the outcome equaton U t. Ths happens because an ndvdual s partcpaton decson s probably based on personal characterstcs that may well affect the outcome Y as well. If ths s so, and f we are unable to control for all the characterstcs affectng Y and d smultaneously, then some correlaton between the error term, U, and the partcpaton varable, d, s expected. In such case, the standard econometrc approach, whch would regress Y on a set of regressors ncludng d, s not vald. We assume that the partcpaton decson can be parametrsed n the followng way. For each ndvdual, there s an ndex, IN, dependng on a set of 2 In most of what follows, we wll assume a lnear specfcaton of the outcome equaton. However, ths s relaxed when dealng wth non-parametrc estmators, as n the case of the general matchng estmator descrbed n Secton IV(4). 431
Fscal Studes varables Z and parameters γ, for whch enrolment occurs when ths ndex rses above zero. That s, (2) IN = Z γ + V, where V s the error term and (3) d = 1 f IN > 0 d = 0 otherwse. 2. Heterogeneous Treatment Effects However, t seems reasonable to assume that the treatment mpact vares across ndvduals. Naturally, these dfferentated effects should also nfluence the decson process and so are lkely to be correlated wth the treatment ndcator, d. Abstractng from other regressors, X, the outcome equaton takes the form (when t > k) (4) Y = β + dα + U, t t where α s the treatment mpact on ndvdual. Defne α as the populaton mean mpact, ε as worker s devaton from the populaton mean and α T as the mean mpact of treatment on the treated. Thus (5) α = α + ε α = α + E( ε d = 1), T where E( ε d = 1) stands for the mean devaton of the mpact among partcpants. The outcome regresson equaton may now be rewrtten n the followng way: (6) Y = β + dα + [ U + dε ] = β + dα + [ U + d ( α α)]. t t t Obvously, the addtonal problem wth ths heterogeneous specfcaton of treatment effects concerns the form of the error term, Ut + d ( α α). Ths can be seen to dffer across observatons accordng to the treatment status of each ndvdual as d assumes the values 0 and 1. The dentfcaton of the parameter α s more dffcult n the case of non-zero correlaton wth the 432
Evaluaton Methods for Non-Expermental Data treatment ndcator. Notce that f E( εd) 0, we should have E( ε d) 0, 3 and thus (7) EY ( d) = β + d[ α + E( ε d)] + EU ( d). t t In ths case, the ordnary least squares (OLS) estmator dentfes (8) E( αˆ ) = α + E( ε d = 1) + E( U d = 1) E( U d = 0). t t Consequently, even f U t s uncorrelated wth d, so that EU ( t d = 1) = EU ( t d = 0) = 0, an dentfcaton problem remans. It s clear from (8) that, wthout further assumptons or nformaton, only the mpact of treatment on the treated, αt = α + E( ε d = 1), s dentfable. Ths s because, even f the error term, U, s uncorrelated wth the decson process, the ndvdual-specfc component of the treatment effect, ε, s most lkely not to be. We expect ndvduals to decde takng nto account ther own specfc condtons, n whch case E( ε d = 1) 0 and the dentfcaton of α becomes more dffcult. III. EXPERIMENTAL AND NON-EXPERIMENTAL DATA 1. Expermental Data As mentoned above, expermental data provde the correct mssng counterfactual, elmnatng the evaluaton problem. The contrbuton of expermental data s to rule out self-selecton (accordng to observables or unobservables) as a source of bas. In fact, as ndvduals are randomly assgned to the programme, a decson process such as the one descrbed n Secton II s ruled out. Let us suppose, for example, that an experment s conducted and that a random sample from a group of elgble ndvduals s chosen to partcpate n a programme; these are admnstered the treatment. Wthn that target group, assgnment to treatment s completely ndependent of a possble outcome varable, whch s to say that t s ndependent of the treatment effect. If no sdeeffects exst, the comparson group composed of the non-treated s statstcally equvalent to the treated group n all respects except treatment status. In the case 3 Ths s because, by terated expectatons, E( ε d ) = E [ E( ε d d )] = E [ E( ε d = 1)] = Prob( d = 1) E( ε d = 1) d d and, by constructon, E( ε ) = Prob( d = 1) E( ε d = 1) + Prob ( d = 0) E( ε d = 0) = 0, whch means that, n general, E( ε d ) 0. 433
Fscal Studes of homogeneous treatment effects, where the α are the same for all, the mpact of treatment can be easly measured by a smple subtracton of mean outcomes: (9) (1) (0) ˆ Yt Yt, t k, α = > (1) where Y t and Y (0) t are, respectvely, the treated and non-treated mean outcomes at a tme t after the programme. However, some factors assocated wth the expermental desgn may nvaldate ths deal settng. It s lkely that some drop-out occurs, especally among the expermental controls. If ths process s not random, t wll alter the fundamental characterstc of expermental data. An dea of the mportance of ths non-random selecton may be obtaned by comparng the observable characterstcs of both the control and treatment groups. Ths comparson ensures random assgnment, at least wth respect to the observables. If the nontreated are offered other treatment programmes, further dfferentatng factors are ntroduced and the comparson of means n (9) s unable to dentfy the treatment effect. Fnally, other factors may change the behavour of experment partcpants, such as the experment tself when selectng treated and non-treated. Ths also nvaldates the consstency of such an estmator n an expermental framework. 2. Non-Expermental Data Despte the above comments, non-expermental data are even more dffcult to deal wth and requre specal care. Imagne a dataset composed of a treatment group from a gven programme and a comparson group drawn from the populaton at large. Even when the choce of the comparson group obeys the strct comparablty rules based on observable nformaton, whch s frequently qute hard or even mpossble to guarantee, we cannot be sure about the absence of dfferences n unobservables that are related to programme partcpaton. Ths s the econometrc selecton problem, as commonly defned. In ths case, usng the estmator (9) results n a fundamental non-dentfcaton problem. Abstractng from other regressors n the outcome equaton, for large samples the estmator dentfes (10) E( αˆ ) = α + [ E( U d = 1) E( U d = 0)]. t t In the case where EU ( td ) 0, unless the terms n the square brackets cancel out, ths expectaton wll dffer from α. Thus alternatve estmators are needed. Ths motvates the methods we wll focus on below n Secton IV: nstrumental varables, selecton, dfference-n-dfferences and matchng methods. 434
Evaluaton Methods for Non-Expermental Data 3. An Example: The LaLonde Study To hghlght the dstncton between experments and non-experments, we brefly consder the study by LaLonde (1986). Ths used an experment dataset to compare between expermentally and non-expermentally determned results and between dfferent types of non-expermental estmaton methodologes. The programme the study s based on s called Natonal Supported Work Demonstraton (NSWD). Ths was operated n 10 stes across the US and was desgned to help dsadvantaged workers, n partcular women n recept of AFDC (ad for famles wth dependent chldren), ex-drug-addcts, ex-crmnaloffenders and hgh-school drop-outs. Qualfed applcants were randomly assgned to treatment, whch comprsed a guaranteed job for nne to 18 months. Treatment and control groups totalled 6,616 ndvduals. Data on all partcpants were collected before, whle and after treatment took place, and earnngs were the chosen outcome measure. To assess the relablty of the expermental desgn, pre-treatment earnngs and other demographc varables for male treatments and controls are presented n Table 1 (see also LaLonde (1986)). It can be seen that there are no sgnfcant dfferences to be found between these two groups: they were statstcally equvalent n terms of observables, at least at the start of the programme. In the absence of non-random drop-out and wth no alternatve treatment offered and no changes n behavour nduced by the experment, the controls consttute the perfect counterfactual to estmate the treatment mpact. Table 2 shows the earnngs evoluton for treatments and controls from a preprogramme year (1975), through the treatment perod (1976 77), untl the postprogramme perod (1978). It can be seen that the treatments and controls earnngs were nearly the same before treatment, dverged substantally durng the TABLE 1 Comparson of Treatments and Controls: Characterstcs for the NSWD Males Treatments Controls Age 24.49 23.99 Years of school 10.17 10.17 Proporton hgh-school drop-outs 0.79 0.80 Proporton marred 0.14 0.13 Proporton black 0.76 0.75 Proporton Hspanc 0.12 0.14 Real earnngs one year before treatment a 1,472 1,558 Real earnngs two years before treatment a 2,860 3,030 Hours worked one year before treatment 278 274 Hours worked two years before treatment 458 469 Number of observatons 2,083 2,193 a Annual earnngs n US dollars. 435
Fscal Studes TABLE 2 Annual Earnngs of Male Treatments and Controls Treatments Controls 1975 3,066 3,027 1976 4,035 2,121 1977 6,335 3,403 1978 5,976 5,090 Number of observatons 297 425 programme and converged somewhat after t. The estmated mpact one year after treatment s almost +$900. Another nterestng feature of the expermental data s the robustness to the choce of estmator. Table 3 (see Tables 5 and 6 of LaLonde (1986)) ncludes a set of estmates obtaned usng the control group and a number of other constructed comparson groups and based on dfferent specfcatons that result n dfferent estmaton technques. The choce of the non-expermentally TABLE 3 Estmated Treatment Effects for the NSWD Male Partcpants usng the Control Group and Comparson Groups from the PSID and the CPS-SSA Comparson group Unadjusted dfference of mean postprogramme earnngs Adjusted dfference of mean postprogramme earnngs Unadjusted dfference-ndfferences Adjusted dfference-ndfferences Two-step estmator Controls 886 798 847 856 889 PSID 1 15,578 8,067 425 749 667 PSID 2 4,020 3,482 484 650 PSID 3 697 509 242 1,325 CPS-SSA 1 8,870 4,416 1,714 195 213 CPS-SSA 2 4,095 1,675 226 488 CPS-SSA 3 1,300 224 1,637 1,388 Defntons: PSID 1 all male household heads contnuously n the perod studed (1975 78) who were less than 55 years old and dd not classfy themselves as retred n 1975. PSID 2 all men n PSID 1 not workng when surveyed n the sprng of 1976. PSID 3 all men n PSID 1 not workng when surveyed n ether the sprng of 1975 or the sprng of 1976. CPS-SSA 1 all males based on Westat s crteron except those over 55 years old. 4 CPS-SSA 2 all males n CPS-SSA 1 who were not workng when surveyed n March 1976. CPS-SSA 3 all males n CPS-SSA 1 who were unemployed n 1976 and whose ncome n 1975 was below the poverty level. 4 Westat s crteron selects ndvduals who were n the labour force n March 1976 wth nomnal ncome less than $20,000 and household ncome less than $30,000. 436
Evaluaton Methods for Non-Expermental Data determned comparson group s qute mportant, gven the goal of reproducng the expermental settng as closely as possble. The am s therefore to construct optmally a group of non-partcpants that closely reproduces what the partcpants would have been wthout the programme whch the group of controls s assumed to represent (expermental data). Gven the observed characterstcs, the comparson groups were drawn ether from the Panel Study of Income Dynamcs (those desgnated by PSID) or from the Current Populaton Survey, Socal Securty Admnstraton (those desgnated by CPS-SSA). Usng comparsons from non-expermental control samples not only appears to change the results sgnfcantly but also rases the problem of dependence on the adopted specfcaton for the earnngs functon and partcpaton decson. We now turn to a general dscusson of non-expermental methods n homogeneous and heterogeneous treatment effect models. IV. METHODS FOR NON-EXPERIMENTAL DATA The approprate methodology for non-expermental data depends on three factors: the type of nformaton avalable to the researcher, the underlyng model and the parameter of nterest. Datasets wth longtudnal or repeated crosssecton nformaton support less restrctve estmators due to the relatve rchness of nformaton. Not surprsngly, there s a clear trade-off between the avalable nformaton and the restrctons needed to guarantee a relable estmator. Two estmators wll be consdered when only a sngle cross-secton s avalable namely, the nstrumental varables (IV) and the two-step Heckman selecton estmators. The IV method uses at least one varable that s related to the partcpaton decson but otherwse unrelated to the outcome. It provdes the requred randomness n the assgnment rule snce the nstrument s assumed to be n no way related to the outcome except through partcpaton. Thus the relatonshp between the nstrument and the outcome for dfferent partcpaton groups dentfes the mpact of treatment avodng selecton problems. The Heckman selecton estmator s a two-step method that uses an explct model of the selecton process to control for the part of the partcpaton decson that s correlated wth the error term n the outcome equaton. If the avalable data are n a longtudnal or repeated cross-secton format, dfference-n-dfferences (dff-n-dffs) can provde a more robust estmate of the mpact of the treatment. We wll outlne the condtons necessary for dff-n-dffs to estmate the mpact parameter of nterest relably. In partcular, we wll also suggest an extenson to overcome the common trends assumpton. Ths assumpton, whch s crucal for the consstency of the estmator, states that the treatment and comparson groups are affected n the same way by macro shocks. Ths, of course, s often dffcult to justfy for comparson groups chosen from non-expermental data. 437
Fscal Studes An alternatve approach s the method of matchng, whch can be adopted wth ether cross-secton or longtudnal data, although typcally detaled ndvdual nformaton s requred from before and after the programme for both the partcpant group and the non-partcpant comparson group. It wll be shown that, wth suffcently detaled data, a smple propensty score method of matchng can often produce qute reasonable results. Matchng deals wth the selecton process by constructng a comparson group of ndvduals wth observable characterstcs smlar to those of the treated. One way of dong ths s to model the probablty of partcpaton, estmate ts value for each ndvdual (called the propensty score) and match ndvduals wth smlar propensty scores. As wll be explaned below, a non-parametrc propensty score approach to matchng that combnes ths method wth dff-n-dffs has the potental to mprove the qualty of non-expermental evaluaton results sgnfcantly. For each estmator, we wll dscuss ts ablty to dentfy the treatment mpact n a homogeneous and a heterogeneous envronment, as well as other specfc advantages and dsadvantages. The cross-secton methodologes are ntroduced n the frst two subsectons: frst the IV estmator s presented and then the Heckman selecton estmator (Heckman, 1979). Subsecton 3 dscusses the dffn-dffs approach and potental extensons when the common macro trends restrcton does not hold. In subsecton 4, we present the standard matchng method and extensons to more refned technques, such as the use of propensty scores to match and the use of dff-n-dffs along wth matchng. 1. The Instrumental Varables (IV) Estmator Consder, frst, the homogeneous treatment effect case. The IV method requres the exstence of at least one regressor exclusve to the decson rule, Z *, satsfyng the followng three condtons: frst, Z * determnes programme partcpaton that s, t has a non-zero coeffcent n the decson rule; second, * we can fnd a transformaton, g, such that g( Z ) s uncorrelated wth the error, * U, gven the exogenous varables, X; fnally, Z s not completely (or almost) * determned by X. The varable(s) Z s/are called the nstrument(s), and t s a source of exogenous varaton used to approxmate randomsed trals: t provdes varaton that s correlated wth the partcpaton decson but does not affect the potental outcomes from treatment drectly. Under the above condtons, the standard IV procedure may be appled, replacng the treatment ndcator by * * g( Z ) and runnng a regresson. An alternatve s to use both Z and X to predct d, buldng a new varable, ˆd, whch s used n the regresson nstead of d. Ths s a very smple estmator but t suffers from two man drawbacks. The frst concerns the nstrument choce. In the treatment evaluaton problem, t s not easy to thnk of a varable that satsfes all the three assumptons requred to 438
Evaluaton Methods for Non-Expermental Data dentfy α. The dffculty les, manly, n the smultaneous requrements of partcpaton determnaton and non-nfluence on the outcome of partcpaton. A commonly proposed soluton, possble when longtudnal or past data are avalable, s to consder lagged values of some determnant varables. However, they are lkely to be strongly correlated wth future values, ncluded n the outcome regresson, and hence ths s unlkely to solve the problem. The second ssue becomes clear when tryng to evaluate the mpact of tranng n a heterogeneous framework. To understand why, recall that, from (6), the error term s gven by (11) U + dε = U + d ( α α). t t * It s now evdent that, even f Z s uncorrelated wth U t, the same s not true * wth respect to Ut + d ( α α) because Z determnes d by assumpton. The volaton of ths fundamental hypothess nvaldates the applcaton of the IV methodology n a heterogeneous framework. 2. The Heckman Selecton Estmator Ths method s more robust than the IV estmator but also more demandng on assumptons about the structure of the model. As above, the smpler homogeneous treatment effect case wll be consdered frst. The man assumpton requred to guarantee relable estmates of the treatment effect s the exstence of at least one addtonal regressor n the decson rule. Ths regressor s requred to have a non-zero coeffcent n the decson rule equaton and to be ndependent of the error term, V. Moreover, knowledge of or ablty to estmate consstently the jont densty of the dstrbuton of the errors U t and V h( Ut, V ), say s requred. The ratonale of ths estmator s to control drectly for the part of the error term n the outcome equaton that s correlated wth the partcpaton dummy varable. The procedure uses two steps. In the frst, the part of the error term U t that s correlated wth d s estmated. It s then ncluded n the outcome equaton and the effect of the programme s estmated n a second step. Of course, by constructon, what remans of the error term n the outcome equaton s not correlated wth the partcpaton decson. Take, for example, the popular specal case where U t and V are assumed to follow a jont normal dstrbuton. Adoptng the standardsaton σ V = 1, we may now wrte the condtonal outcome expectaton as 439
Fscal Studes (12) φ( Zγ) EY ( t d = 1) = β + α + ρ Φ ( Z γ ) φ( Zγ) EY ( t d = 0) = β ρ, 1 Φ( Z γ ) where the last term on the rght-hand sde of each equaton represents the expected value of the error term condtonal on the partcpaton varable, d. Ths s precsely what s mssng from (1) when assgnment to treatment s nonrandom, as descrbed n subsecton II(1). Ths new regressor deals wth the part of the error term that s correlated wth the decson process. By ncludng t n the outcome equaton, we are able to separate the true mpact of treatment from the selecton process, whch accounts for the dfferences between partcpants and non-partcpants. Thus t s possble to estmate α, the Heckman selecton estmator for the selecton model, by replacng γ wth γ ˆ (obtaned from regressng IN on Z) and runnng a least squares regresson on (12). 5 (a) The Heckman Selecton Estmator: Choce-Based Samples One advantage of the two-step procedure n the homogeneous treatment effect case relates to ts robustness to choce-based samplng. Ths s the knd of nonrandomness obtaned when drawng the comparson group (non-treated) from the * populaton. Usually, the sample proporton of treated ( p t ) dffers from the populaton one ( p t ). The treated are lkely to be over-represented n the sample, resultng n a non-zero expectaton of the outcome error term: (13) peu ( d = 1) + (1 p) EU ( d = 0) 0. * * t t t t Robustness s acheved by controllng for the part of U t that s correlated wth d. In fact, snce the remanng error s orthogonal to d, t s unaffected by ths type of stratfcaton. (b) The Heckman Selecton Estmator: Heterogeneous Treatment Effects Now suppose that the treatment mpact dffers across agents. The outcome equaton becomes (14) Y = β + α d + { U + d [ ε E( ε d = 1)]} = β + α d + ξ. t T t T t 5 For a more detaled descrpton of ths estmator, see the Appendx. 440
Evaluaton Methods for Non-Expermental Data The two-step procedure requres knowledge of the jont densty of ε. Contnung to assume a jont normal dstrbuton ( σ = 1), V U t, V and (15) φ( Zγ) φ( Zγ) E( ξ d = 1) = Corr( U + ε, V )Var( U + ε ) = ρ ( ) Φ( Zγ) 1/ 2 t t t ( U, V, ε ) Φ Zγ φ( Zγ) φ( Zγ) E ξ d = = U V U = ρ 1/ 2 ( t 0) Corr( t, )Var( t ) ( U, V ). 1 Φ( Zγ) 1 Φ( Zγ) Hence the outcome regresson equaton s = β + α + ρ φ( Zγˆ) + (1 ) ρ φ( Zγˆ) + δ, (16) Yt d T ( U, V, ε ) (, ) ( ˆ d U V t Φ Zγ) 1 Φ( Zγˆ) whch consequently dentfes α T. However, ths method s unable to dentfy α, the effect of tranng f ndvduals were randomly assgned to treatment. In fact, f α s the parameter of nterest, the approprate equaton s (17) Y = β + αd + ( U + dε ) = β + αd + η. t t t Notce that the error term for the treated no longer has a zero expectaton. Formally, (18) E( η d = 1) = E( U + dε d = 1) = E( ε d = 1) + ρ t t ( U, V, ε ) φ( Zγ) E( ηt d = 0) = E( Ut d = 0) = ρ( U, V ). 1 Φ( Z γ ) φ( Zγ) Φ( Z γ ) Therefore the outcome equaton s gven by (19) Y = β + d α + E( ε d = 1) + ρ φ( Z γˆ ) + (1 d ) ρ + δ, t ( U, V, ε ) ( U, V) t 1 Φ( Zγˆ ) φ( Zγˆ ) ( Zγˆ Φ ) whch s exactly the same equaton as the one obtaned when tryng to estmate α. That s, only the treatment-on-the-treated mpact s dentfable. T 441
Fscal Studes 3. The Dfference-n-Dfferences (Dff-n-Dffs) Estmator If longtudnal or repeated cross-secton nformaton s avalable, t s possble to estmate the treatment effect consstently wthout havng to mpose such restrctve condtons. To apply the dff-n-dffs estmator, at least one preprogramme set and one post-programme set of observatons are requred. Let t 0 and t 1 denote the pre- and post-programme perods for whch data are avalable. The dff-n-dffs estmator measures the excess outcome growth for the treated compared wth the non-treated. Formally, abstractng from other regressors besdes the treatment ndcator, ˆ ( ) ( ), (20) α T T C C DID = Yt Y 1 t Y 0 t Y 1 t0 T C where Y and Y are the mean outcomes for the treatment and comparson (non-treatment) groups, respectvely. (a) The Dff-n-Dffs Estmator: Heterogeneous Treatment Effects Where the mpact of treatment s heterogeneous, provded the above condtons are verfed, the dff-n-dffs estmator recovers the mpact of the treatment on the treated: (21) E( αˆ ) = [ β + α + E( U d = 1) β E( U d = 1)] DID T t1 t0 [ β + EU ( d= 0) β EU ( d= 0)] = α. t1 t0 T That s, the effect of treatment on the treated s dentfable, but not the populaton mpact. Intutvely, ths happens because the unobserved component of the treatment mpact enters n the model as a temporary ndvdual-specfc effect that determnes partcpaton. (b) The Dff-n-Dffs Estmator: The Common Trends and Tme-Invarant Composton Assumptons In contrast to the IV and Heckman selecton estmators, no excluson restrctons appear to be requred for the dff-n-dffs estmator. In fact, there s no need for any regressor n the decson rule. Even the outcome equaton does not have to be specfed as long as the treatment mpact enters addtvely. However, strong restrctons on common trends and error composton are mplct, whch we now descrbe. Consder the followng decomposton of the unobservables, (22) U = φ + θ + µ, t t t U t : 442
Evaluaton Methods for Non-Expermental Data where φ s an ndvdual-specfc effect, constant over tme, θ t s a common macroeconomc effect, the same for all agents, and µ t s a temporary ndvdualspecfc effect. Notce that, f the expectaton of U t condtonal on the treatment status depends on the temporary ndvdual-specfc effect, µ t, dff-n-dffs s nconsstent. Ths estmator s, however, able to control for the other two errorterm components as they cancel out on subtracton. As s straghtforward to verfy, a separablty condton between ndvdual and temporal effects has to be assumed: (23) EU ( d) = E( φ d) + θ. t t Even smpler than ths estmator, a smple dfference method could be appled f the only unobservable term s φ, the constant ndvdual-specfc effect. The estmator T T (24) α ˆ D = ( Yt Y ) 1 t0 would suffce to dentfy α consstently. There are two man weaknesses of the dff-n-dffs approach. The frst relates to the lack of control for unobserved temporary ndvdual-specfc components that nfluence the partcpaton decson. In fact, the followng can be wrtten: E( αˆ ) = α + E( µ µ d = 1) E( µ µ d = 0). (25) t1 t0 t1 t0 To llustrate the condtons under whch such nconsstency mght arse, suppose we are nterested n evaluatng a tranng programme n whch enrolment s more prone to happen f a temporary dp n earnngs occurs just before the programme takes place (so-called Ashenfelter s dp; see Heckman and Smth (1994)). A faster earnngs growth s expected to occur among the treated, even wthout programme partcpaton. Thus the dff-n-dffs estmator s lkely to overestmate the mpact of treatment. Also, f only repeated cross-secton data are avalable, t may be dffcult to control for the before after comparablty of the groups under ths type of selecton nto the programme. That s, f ndvduals select nto the programme accordng to some unknown rule, and repeated crosssecton data are beng used, the assumpton that E( φ d) s constant over tme for each group may be too strong because the composton of the groups may change over tme and be affected by the nterventon. The second weakness occurs f the macro effect has a dfferental mpact across the two groups. Ths happens when the treatment and comparson groups have some (possbly unknown) characterstcs that dstngush them and make 443
Fscal Studes them react dfferently to common macro shocks. Ths motvates the dfferentaltrend-adjusted dff-n-dffs estmator that s presented below. (c) The Dff-n-Dffs Estmator: Adjustng for Dfferental Trends Suppose that the comparson group and target group actually satsfy (26) EU ( d) = E( φ d) + kθ, t g t where the k g acknowledges the dfferental macro effect across the two groups. Now t can be seen that the dff-n-dffs estmator dentfes (27) E( αˆ DID ) = α + ( kt kc )( θt θ ), 1 t0 where T and C refer to the treatment and control groups, respectvely. Ths clearly only recovers the true effect of the programme when kt = kc. Now suppose we take another tme nterval t * to t **, over whch a smlar macro trend has occurred. Precsely, we requre a perod for whch the macro trend matches the term ( kt kc)( θt θ ) 1 t n (27). It s lkely that the most recent 0 cycle s the most approprate, earler cycles possbly havng systematcally dfferent effects across the target and comparson groups. The dfferentally adjusted estmator proposed by Bell, Blundell and Van Reenen (1999), whch takes the form ˆ [( ) ( )] [( ) ( )], (28) α T T C C T T C C TADID = Yt Y 1 t Y 0 t Y 1 t Y 0 t Y ** t Y * t Y ** t* wll now consstently estmate α. 4. The Matchng Estmator The matchng method s a non-parametrc approach to the problem of dentfyng the treatment mpact on outcomes. It s more general n the sense that no partcular specfcaton has to be assumed. Moreover, t can be combned wth other methods, producng more accurate estmates and allowng for less restrctve assumptons. However, t too rests on strong assumptons and partcularly heavy data requrements. The man purpose of matchng s to re-establsh the condtons of an experment when no such data are avalable. As dscussed earler, wth total random assgnment wthn one group, one could compare the treated and the non-treated drectly, wthout havng to mpose any structure on the problem. Wth the matchng method, the constructon of a correct sample counterpart for the mssng nformaton on the treated outcomes had they not been treated 444
Evaluaton Methods for Non-Expermental Data conssts n parng each programme partcpant wth members of a comparson group (non-treated). Under the matchng assumpton, the only remanng dfference between the two groups s programme partcpaton. (a) The Matchng Estmator: General Method To llustrate the matchng soluton n a more formal way, consder a general specfcaton of the outcome functon, (29) Y = g ( X) + U T T T C C C Y = g ( X) + U, T where Y and Y C are the outcomes of the treated and the non-treated (comparson group), whch can be wrtten as a functon of the set of observables, T C X, plus the unobservable term, U or U. Note that we allow for dfferent outcome functons accordng to the partcpaton decson. As above, the most common goal of evaluaton s to dentfy the mpact of the treatment on the treated: T C (30) α T = EY ( Y X, d= 1). The soluton advanced by matchng s based on a fundamental assumpton of condtonal ndependence between non-treated outcomes and programme partcpaton: C (31) Y d X. Ths assumpton states that the outcomes of the non-treated are ndependent of the partcpaton status, d, once one controls for the observable varables, X. That s, gven X, the non-treated outcomes are what the treated outcomes would have been had they not been treated or, n other words, selecton occurs only on observables. 6 T For each treated observaton, Y, we can look for a non-treated (set of) observaton(s), Y C, wth the same X-realsaton. Wth the matchng C assumpton, ths Y consttutes the requred counterfactual. Actually, ths s a process of rebuldng an expermental dataset whch, n general, places strong requrements on data collecton. Addtonally, matchng also assumes that 0 < Prob( d = 1 X) < 1 n order to guarantee that all treated agents have a counterpart n the non-treated populaton, and that anyone consttutes a possble partcpant. However, ths does not ensure 6 Rosenbaum and Rubn, 1985; Rubn, 1979. 445
Fscal Studes that the same happens wthn any sample, and t s, n fact, a strong assumpton when programmes are drected to tghtly specfed groups. Let S be the set of all possble values the vector of explanatory varables, X, * may assume. It s called the support of X. Let S be the common support of X, or the space of X that s smultaneously observed among partcpants and nonpartcpants for the specfc dataset beng used. Assumng the above descrbed condtons, a subset of comparable observatons s formed from the orgnal sample, and wth those a consstent estmator for the treatment mpact on the treated, α T, s the emprcal counterpart of T C EY ( Y X, d= 1)d F( X d= 1) S* (32). d F( X d = 1) S* The numerator of the above expresson represents the expected gan from the programme among the subset of partcpants who are sampled and for whom one * can fnd a comparable non-partcpant (that s, over S ). To obtan a measure of the mpact of the treatment on the treated, ndvdual gans must be ntegrated over the dstrbuton of observables among partcpants and re-scaled by the measure of the common support, S *. The fracton therefore represents the * expected value of the programme effect n the common support of X, S. It s smply the mean dfference n outcomes over the common support, approprately weghted by the dstrbuton of partcpants. If the second assumpton s fulflled and the two populatons are large enough, the common support s the entre support of both. As should now be clear, the matchng method avods specfyng a partcular form for the outcome equaton, decson process or ether unobservable term. We smply need to ensure that, gven the rght observables, X, the observatons of non-partcpants are statstcally what the observatons of the treated would be had they not partcpated. Under a slghtly dfferent perspectve, t mght be sad that we are decomposng the treatment effect n the followng way: (33) T C T C EY ( Y X, d= 1) = [ EY ( X, d= 1) EY ( X, d= 0)] C C [ EY ( X, d= 1) EY ( X, d= 0)], the latter rght-hand-sde term beng the bas condtonal on X, whch s assumed to be zero. The technque s to replace the unobserved outcomes of the partcpants had they not been treated wth the outcomes of non-partcpants wth the same X-characterstcs. 446
Evaluaton Methods for Non-Expermental Data (b) The Matchng Estmator: The Role of the Partcpaton Decson Up to now, we have been dfferentatng ndvduals based on partcpaton. However, the structural dfference should rely on the partcpaton decson. The partcpaton decson, though, s not observable among non-partcpants. These form a mxture of those who, f offered the programme, would have decded to partcpate and those who would have decded not to. All the partcpants, however, were wllng to be treated when the programme was offered to them. In such case, the outcome equatons would be (34) Y = g ( X) + U T T T Y = g ( X) + [ d U + (1 d ) U ], C C D C D C 1 0 where D C d s a dummy varable standng for partcpaton decson and 1 U and C U 0 are the outcome error terms for non-partcpants who would and would not be wllng to partcpate, respectvely. The parameter of nterest the mean mpact of treatment s T C T C T C D (35) E( Y Y X, d = 1) = g ( X) g ( X) + E( U U1 X, d = 1). Therefore there are two possbltes underlyng matchng assumptons: D * * Prob( d = 1 X) = 1 X S or ( C C EU0 X) = EU ( 1 X) X S. The frst hypothess states that X completely determnes the partcpaton decson: anyone * charactersed by a value of X on the common support, S, would be wllng to partcpate f offered the programme. Ths s the desred outcome f one s wllng to reconstruct an expermental settng, snce t states that X s enough to buld up a comparson group wth the desred smlartes to the treatment group. The second assumpton states that, at least as far as the unobservables are concerned, the two comparson groups defned by the partcpaton decson are equal. Ths means that partcpaton decsons are beng based on observables alone and the matchng assumpton (31) follows. Under ths formulaton, matchng s always preferable to random samplng f D C t ncreases Prob( d = 1) among comparsons and/or f t brngs EU ( 1 ) and C * EU ( 0 ) closer n the support, S. Any of these condtons causes the comparson group to become more smlar to the treatment group n the sense that at least a part of the dfference s beng controlled for by the observables. Ths s the advantage of applyng matchng under such crcumstances. 447
Fscal Studes (c) The Matchng Estmator: The Use of the Propensty Score It s clear that when a wde range of varables X s n use, matchng can be very dffcult due to the hgh dmensonalty of the problem. A more feasble alternatve s to match on a functon of X. Usually, ths s carred out on the propensty to partcpate, gven the set of characterstcs, X: PX ( ) = Prob( d = 1 X), whch s the propensty score. Its use s usually motvated by Rosenbaum and Rubn s (1983 and 1984) result. It s shown that, under the (matchng) assumptons T C (36) ( Y, Y ) d X and 0 < Prob( d = 1 X) < 1, the condtonal ndependence remans vald f controllng for P(X) nstead of X: T C (37) ( Y, Y ) d P( X). More recently, a study by Hahn (1998) shows that the propensty score s ancllary for the estmaton of the average effect of treatment on the populaton. However, t s also shown that knowledge of the propensty score may mprove the effcency of the estmates of the average effect of treatment on the treated. Its value for the estmaton of ths latter parameter les n the dmenson reducton feature. When usng the propensty score, the comparson group for each treated ndvdual s chosen wth a pre-defned crteron (establshed by a pre-defned measure) of proxmty. Havng defned the neghbourhood for each treated observaton, the next ssue s that of choosng the approprate weghts to assocate the selected set of non-treated observatons wth each partcpant one. Several possbltes are commonly used, from a unty weght to the nearest observaton and zero to the others, to equal weghts to all, or kernel weghts, whch account for the relatve proxmty of the non-partcpants observatons to the treated ones n terms of P(X). In general, the form of the matchng estmator s gven by (38) αˆ MM = Y WjYj w, T j C where W j s the weght placed on comparson observaton j for ndvdual and w accounts for the reweghtng that reconstructs the outcome dstrbuton for the treated sample. For example, n the nearest neghbour matchng case, the estmator becomes 448
Evaluaton Methods for Non-Expermental Data (39) 1 αˆ MM = ( Y Yj ), N T T where j s the nearest neghbour n terms of P(X) n the comparson group to n the treatment group. In general, kernel weghts are used for W j to account for the closeness of Y j to Y. (d) The Matchng Estmator: Parametrc Approach Specfc functonal forms assumed for the g-functons n (29) can be used to estmate the mpact of treatment on the treated over the whole support of X, reflectng the trade-off between the structure one s wllng to mpose n the model and the amount of nformaton that can be extracted from the data. To estmate the mpact of treatment under a parametrc set-up, one needs to estmate the relatonshp between the observables and the outcome for the treatment and comparson groups and predct the respectve outcomes for the populaton of nterest. A comparson between the two sets of predctons supples an estmate of the mpact of the programme. In ths case, one can easly guarantee that outcomes beng compared come from populatons sharng exactly the same characterstcs. When a lnear specfcaton s assumed wth common coeffcents for treatments and controls, so that (40) = β + α + T Y X T d U C Y = Xβ + U, not even the common support requrement s needed to estmate the mpact of treatment on the treated a smple OLS regresson usng all nformaton on the treated and non-treated wll consstently dentfy α T. (e) The Matchng Estmator: Drawbacks It s lkely that matchng does not succeed n fndng a non-treated observaton wth smlar propensty score for all the partcpants. That s, for some observatons, we mght be unable to fnd the rght counterfactual, whch means that the common support s just a subset of the complete treated support. If the mpact of treatment s homogeneous, at least wthn the treatment group, no addtonal problems appear besdes the loss of nformaton. Note, however, that the settng s general enough to nclude the heterogeneous case. If the mpact of tranng s heterogeneous wthn the treatment group tself and the counterfactual s more dffcult to obtan for some subgroup(s) of the partcpants, t may be 449
Fscal Studes mpossble to dentfy α T. In other words, f the matchng process leads to a consderable loss of observatons, the estmator s lmted by the loss of nformaton and s only consstent for the common support. In the heterogeneous response case, f the expected mpact of partcpaton dffers across the treated, t s possble that the estmated mpact does not represent the mean outcome of the programme. Another potental problem wth matchng s the (heavy) requrements on data. To guarantee that assumpton (31) s verfed, t s mportant to obtan the relevant nformaton to dstngush potental partcpants from others, whch s not always easy. On the other hand, the more detaled the nformaton s, the harder t s to fnd a smlar control and the more restrcted the common support becomes. That s, the correct balance between the quantty of nformaton to use and the share of the support covered may be dffcult to acheve. (f) A Bas Decomposton The bas term can be decomposed nto three dstnct parts: C C (41) Bas= EY ( X, d= 1) EY ( X, d= 0) = B1+ B2 + B3, where B 1 represents the bas component due to non-overlappng support of X, B 2 s the error part due to msweghtng on the common support of X as the resultng emprcal dstrbutons of treated and non-treated are not the same even when restrcted to the same support, and B 3 s the true econometrc selecton bas resultng from selecton on unobservables. Through the process of choosng and reweghtng observatons, matchng corrects for the frst two sources of bas, and the thrd term s assumed to be zero. (g) Matchng and Dff-n-Dffs The assumpton of condtonal ndependence between the error term n the outcome equaton and the tranng status (depcted by (31)) s qute strong f t s possble that ndvduals decde accordng to ther forecast outcome. However, f matchng s combned wth dff-n-dffs, there s scope for an unobserved determnant of partcpaton as long as t can be represented by separable ndvdual- and/or tme-specfc components of the error term. To clarfy the exposton, let us now assume the followng model specfcaton: (42) Y Y = g + φ + θ + µ T T T T t t t t = g + φ + θ + µ, C C C C t t t t 450
Evaluaton Methods for Non-Expermental Data whch dffers from (29) by the composton assumed for the error term and by explctly acknowledgng that the functon g may change over tme. 7 If performng matchng on the set of observables X wthn ths settng, the condtonal ndependence assumpton (31) can now be replaced by C C (43) Y Y d X, 1 0 t t where t 0 and t 1 stand for the before- and after-programme tme perods. Gven (42), assumpton (43) s equvalent to (44) C C C C t1 t θ 0 t θ 1 t0 ( g g ) + ( ) d X. The man matchng hypothess s now stated n terms of the before after evoluton nstead of levels. If both terms of the sum n (44) are condtonally ndependent of the partcpaton decson, then (44) s verfed. It means that controls have evolved from a pre- to a post-programme perod n the same way treatments would have done had they not been treated. Ths happens both on the observable component of the model and on the unobservable tme trend. The effect of the treatment on the treated can now be estmated over the * common support of X, S, usng an extenson to (38): αˆ = ( ) ( ), LD (45) MMDID Yt Y 1 t W 0 j Yjt Y 1 jt0 w T j C where LD denotes longtudnal data and MMDID denotes method of matchng wth dfference-n-dfferences. Qute obvously, ths estmator requres longtudnal data to be appled. It s, however, possble to extend t for the repeated cross-sectons data case. If only repeated cross-sectons are avalable, one must perform matchng three tmes for each treated ndvdual after beng treated: to fnd the comparable treated before the programme and the controls before and after the programme. If the same assumptons apply, one can estmate the effect of treatment on the treated usng the followng estmator: αˆ = Y W Y W Y W Y w, (46) RCS T C C MMDID t1 jt0 t 0 jt1 jt1 jt0 jt 0 T1 j T0 j C1 j C1 7 Of course, ths latter pont s only mportant when comparng dfferent perods, as done wthn the dff-n-dffs methodology. 451
Fscal Studes where RCS denotes repeated cross-secton, T 0, T 1, C 0 and C 1 stand for the treatment and control groups, before and after the programme, respectvely, and G W jt represent the weghts attrbuted to ndvdual j n group G (where G = C or T) and at tme t when comparng wth treated ndvdual. 8 V. SOME EMPIRICAL STUDIES In ths secton, we draw on two studes, one from the UK and one from the US, to llustrate some of the non-expermental technques presented n ths revew. In the nfluental study by LaLonde (1986), t was concluded that none of the used econometrc methodologes estmate accurately the treatment mpact when only non-expermental data are avalable. 9 However, there are two potental ssues wth the LaLonde study (see Heckman, Ichmura and Todd (1997)). The frst concerns the questonnares: controls and comparsons answered dfferent questons, based on dfferent defntons. Second, the comparson group was not guaranteed to operate n the same labour market as the treatments, and hence dfferent macro effects may nfluence each group s behavour. The studes presented below llustrate that the methods we have descrbed for non-expermental data can provde good evaluaton nformaton f carefully handled. Both llustratons concern labour market programmes, the frst takng place n the UK and the second n the US. 1. Dff-n-Dffs and Dfferental Trends: The New Deal Evaluaton n the UK The New Deal for Young People s a recent ntatve of the UK government to help young unemployed people make ther way nto or back to work. The programme s targeted at the 19- to 24-year-old long-term unemployed. Partcpaton s compulsory, so that every elgble ndvdual s due to partcpate under the threat of losng enttlement to benefts. The crtera for elgblty are smple: every ndvdual aged 19 24 by the tme of completon of the sxth month on jobseeker s allowance (JSA) s mmedately assgned to the programme and starts recevng treatment. Gven the stated rules, the programme can be classfed as one of global mplementaton, beng admnstered to lterally everyone n the UK meetng the elgblty crtera. Indrect effects are therefore expected. The nature of these effects wll be dscussed below. Treatment s composed of three steps. On assgnment to the programme, the ndvdual starts an ntensve job-search assstance perod, called the Gateway, whch lasts for up to four months. The second stage s composed of a sx-month spell n subsdsed employment or up to 12 months n full-tme educaton or 8 For a more detaled dscusson wth an applcaton of the combned matchng and dff-n-dffs estmator, see Blundell, Costa Das, Meghr and Van Reenen (2000). 9 Tables 5 and 6 of LaLonde (1986) reveal that better estmates are attaned when usng two-step estmators. 452
Evaluaton Methods for Non-Expermental Data tranng. The former nvolves a payment of a subsdy to the employer whle the employee receves the offered wage. For the latter, the ndvdual receves an amount equvalent to the JSA payment and may be elgble for specal grants n order to cover exceptonal expenses. Once the opton perod s over, ndvduals who reman unemployed enrol n a new perod of ntense job search, the Follow- Through, whch takes up to 13 weeks. The programme was launched n the whole UK by Aprl 1998. There was, however, a prevous three-month expermental perod (January 1998 to March 1998) when the programme was tred n 12 regons, called Pathfnders. The goal was to perform a three-month experment wth the Pathfnders, havng as counterfactual the rest of the UK or some regons that would match the Pathfnders more closely. Clearly, dentfcaton of the treatment effect under these condtons requres stronger assumptons than when the experment s run wthn regons usng random assgnment. As wll be dscussed, the problem relates to the fact that the counterfactual must be drawn ether from a dfferent labour market or from a group wth dfferent characterstcs operatng n the same labour market. Dfferent types of hypotheses wll be studed below. The analyss that follows s based on the study by Blundell, Costa Das, Meghr and Van Reenen (2000). It uses the publcly avalable 5 per cent sample of the whole populaton clamng JSA n the UK snce 1982 (JUVOS). Ths database ncludes a small set of demographc varables and the start and ext dates from the clamant count, makng t possble to reconstruct the unemployment hstory of the ndvduals. The outflow from the clamant count s the outcome of nterest, the choce havng been determned by the avalablty and qualty of the data used (outflows by destnaton are also covered n Blundell, Costa Das, Meghr and Van Reenen (2000), but snce the necessary nformaton s only avalable snce late 1996, we have chosen to focus here on the outflows to all destnatons taken together). Also, snce the programme s very recent, t s stll not possble to make a long-run analyss of the effect of partcpaton. Gven ths, we wll use two measures n tryng to evaluate the effect of the programme: outflows from the clamant count wthn, respectvely, two and four months of completon of the sxth month on unemployment subsdy. The rest of ths secton goes as follows. We start by brefly dscussng the nature of the experment. The second subsecton addresses the problem of choosng and assessng a control group. We then present the estmates of the effect of the programme, and fnally we dscuss these results and ther potental problems, manly related to the nature of the programme. (a) The Expermental Perod We wll present a detaled analyss of the expermental perod of the New Deal for Young People. The experment was undertaken durng the frst three months 453
Fscal Studes of 1998 n a selected set of 12 regons n the UK. Every ndvdual attendng the local employment offces and meetng the elgblty crtera was assgned to the programme and started recevng treatment. Outsde the Pathfnder areas, however, the New Deal was only released three months later, by Aprl 1998. To clarfy thngs, t must be recognsed that what has been done s not a true experment. The man reason relates to the lack of random assgnment. The regons were chosen and the programme was globally mplemented n the selected places. Also, the nformaton collected wthn the programme only ncluded partcpants. We do not make use of data collected by the programme admnstraton n non-pathfnder regons. Ths latter ssue, however, rases no relevant problem for the analyss beng performed snce we are usng a truly random sample of all ndvduals clamng JSA, and for all of them the same type of nformaton s avalable. (b) Defnng and Assessng the Potental Comparson Groups The analyss wll be performed based on three possble comparson groups. They are defned as follows. Comparson Group 1 s composed of ndvduals lvng n non-pathfnder areas, aged 19 24 and completng ther sxth month on JSA durng the frst quarter of 1998. To construct Comparson Group 2, we have used nformaton on the labour market to determne whch regons are closest to the Pathfnder areas n the followng sense. The varable selected to choose the regons was the tme taken to leave unemployment by agents aged 19 24. The procedure used monthly data on the medan number of days clamng JSA, by regon, and for each Pathfnder area selected the two non-pathfnder regons that best reproduced ts tme-seres pattern before the programme took place. Systematc dfferences n levels were not the man concern, snce they can be controlled for usng dff-n-dffs methodologes. Instead, the varablty n the dfference between the two curves was mnmsed, attemptng to make the dfference as constant over tme as possble. Thus Comparson Group 2 comprses the subset of non-pathfnder local labour markets wth a tme pattern that most closely resembles the ones observed for Pathfnder areas. Comparson Group 3 s taken to be the set of ndvduals lvng n Pathfnder areas, aged 25 to 30 and completng ther sxth month on JSA durng the frst quarter of 1998. The treatment group s, of course, composed of ndvduals lvng n Pathfnder areas, aged 19 24 and completng ther sxth month on JSA durng the frst quarter of 1998. In what follows, we wll compare the characterstcs of the dfferent groups before the programme s released. We begn by analysng the tme to leave the clamant count, the varable chosen to select the regons used n Comparson Group 2. Ths varable s not exactly what wll be used as a measure of the 454
Evaluaton Methods for Non-Expermental Data FIGURE 1 Medan Number of Days Clamng JSA: Comparng 19- to 24-Year-Olds and 25- to 30-Year-Olds Lvng n Pathfnder and Non-Pathfnder Areas 150 120 90 25- to 30-year-olds, Pathfnder areas 60 19- to 24-year-olds, Pathfnder areas 19- to 24-year-olds, non-pathfnder areas 30 1983 1985 1987 1989 1991 January 1993 1995 1997 FIGURE 2 Medan Number of Days Clamng JSA: Comparng 19- to 24-Year-Olds Lvng n Pathfnder, Non-Pathfnder and Matched Non-Pathfnder Areas 150 120 90 19- to 24-year-olds, Pathfnder areas 60 19- to 24-year-olds, non-pathfnder areas 30 1983 1985 1987 1989 1991 January 1993 1995 1997 19- to 24-year-olds, matched areas 455
Fscal Studes outcome, snce t ncludes everybody enterng unemployment, not just those remanng unemployed for more than sx months. However, t may provde a good charactersaton of the labour market. Fgures 1 to 3 llustrate the performance of the dfferent comparson groups aganst the performance of the treatment group n terms of the tme to leave unemployment. Fgure 1 ncludes Comparson Groups 1 and 3 along wth the treatment group. The younger groups take less tme to leave the clamant count, and the Pathfnder areas seem to behave hstorcally worse than the rest of the UK as the unemployed there take longer to leave the clamant count. However, snce constant dfferences do not affect the estmates of the treatment effect, we are more nterested n analysng the varablty of the dfferences over tme. The three curves exhbt some parallelsm but maybe not as much as would be desrable. The dfference between the treatment group and Comparson Group 1 curves seems to be more volatle than the dfference between the curves correspondng to the treatment group and Comparson Group 3 (the varances of the dfferences are 3.2 and 2.6, respectvely). Ths seems to ndcate that labour markets for dfferent age-groups n the same regon are more smlar than labour markets n dfferent regons for the same age-group. Fgure 2 presents Comparson Groups 1 and 2 aganst the treatment group. The matchng procedure seems to have created a better comparson group. In fact, the varance of the dfference between treatment group and Comparson Group 2 s about 2.6, lower than the 3.2 found when usng Comparson Group 1. FIGURE 3 Medan Number of Days Clamng JSA: Comparng 25- to 30-Year-Olds Lvng n Pathfnder, Non-Pathfnder and Matched Non-Pathfnder Areas 150 120 90 25- to 30-year-olds, Pathfnder areas 60 25- to 30-year-olds, non-pathfnder areas 25- to 30-year-olds, matched areas 30 1983 1985 1987 1989 1991 January 1993 1995 1997 456
Evaluaton Methods for Non-Expermental Data TABLE 4 Comparng the Characterstcs of the Treatment and Comparson Groups Martal status: proporton marred Entry quarter 1990:I 1991:I 1992:I 1993:I 1994:I 1995:I 1996:I 1997:I Treatment group 0.040 0.036 0.052 0.036 0.040 0.038 0.027 0.028 Comp. Group 1 0.043 0.045* 0.046 0.040 0.036 0.031 0.028 0.026 Comp. Group 2 0.039 0.041 0.043 0.040 0.036 0.027* 0.027 0.026 Comp. Group 3 0.353** 0.318** 0.359** 0.293** 0.290** 0.242** 0.239** 0.204** Unemployed less than sx months over the last two years Entry quarter 1990:I 1991:I 1992:I 1993:I 1994:I 1995:I 1996:I 1997:I Treatment group 0.749 0.781 0.865 0.727 0.688 0.645 0.663 0.692 Comp. Group 1 0.762 0.786 0.883** 0.737 0.683 0.651 0.673 0.674 Comp. Group 2 0.773* 0.812** 0.896** 0.745 0.708 0.661 0.685 0.684 Comp. Group 3 0.520** 0.621** 0.853 0.586** 0.523** 0.450** 0.432** 0.444** Unemployed less than 12 months over the last two years Entry quarter 1990:I 1991:I 1992:I 1993:I 1994:I 1995:I 1996:I 1997:I Treatment group 0.885 0.901 1.000 0.864 0.832 0.803 0.831 0.827 Comp. Group 1 0.899 0.911 0.999 0.883** 0.833 0.801 0.810** 0.822 Comp. Group 2 0.908** 0.928** 1.000 0.887** 0.842 0.807 0.813 0.830 Comp. Group 3 0.732** 0.803** 1.000 0.768** 0.704** 0.652** 0.650** 0.619** No unemployment spells wthn the last two years Entry quarter 1990:I 1991:I 1992:I 1993:I 1994:I 1995:I 1996:I 1997:I Treatment group 0.350 0.367 0.537 0.442 0.464 0.438 0.425 0.445 Comp. Group 1 0.361 0.398** 0.528 0.441 0.437** 0.418 0.419 0.418* Comp. Group 2 0.361 0.409** 0.549 0.445 0.457 0.417 0.431 0.430 Comp. Group 3 0.220** 0.300** 0.483** 0.332** 0.254** 0.235** 0.219** 0.212** Number of observatons Entry quarter 1990:I 1991:I 1992:I 1993:I 1994:I 1995:I 1996:I 1997:I Treatment group 1,727 1,762 1,815 1,752 1,628 1,623 1,512 1,424 Comp. Group 1 11,102 11,869 11,951 12,029 11,014 11,585 9,721 8,402 Comp. Group 2 2,349 2,631 2,834 2,875 2,709 2,555 2,401 2,054 Comp. Group 3 781 881 1,036 1,140 1,089 1,028 1,013 949 Key: Treatment group: men aged 19 24 lvng n Pathfnder areas Comp. Group 1: men aged 19 24 lvng n non-pathfnder areas Comp. Group 2: men aged 19 24 lvng n matched non-pathfnder areas Comp. Group 3: men aged 25 30 lvng n Pathfnder areas *Estmates for treatment and respectve comparson group are statstcally dfferent at 10 per cent. **Estmates for treatment and respectve comparson group are statstcally dfferent at 5 per cent. 457
Fscal Studes Fgure 3 presents the same knd of comparsons for the 25- to 30-year-olds, usng the regons matched for the younger group. Smlar comments apply. Table 4 compares the treatment group wth the three selected comparson groups n a range of other characterstcs before the programme was launched. In general, there are no sgnfcant dfferences between the treatment group and Comparson Groups 1 and 2. Comparson Group 3, however, exhbts a dfferent pattern n lterally all the presented dmensons. As mentoned above, dfferences that are constant over tme do not affect the consstency of the dffn-dffs estmates, and these dfferences certanly show some systematc pattern. (c) The Effect of the Programme To assess the effect of the treatment, we have chosen to use two possble outcome varables: the outflow from the clamant count wthn two months of completng the sxth month on unemployment subsdy (the start of the treatment) and the outflow from the clamant count wthn four months of the start of the treatment. Table 5 presents some of the estmated effects for both measures when comparng the treatment group wth the three comparson groups consdered. The frst column of estmates presented n Table 5 uses a sngle dfferences method. It s assumed that the nstrumental varables used to defne the comparson groups (ether the lvng area or the age) are correlated wth the treatment ndcator but uncorrelated wth the outcome. The results obtaned are as follows. Usng Comparson Group 1, we estmate that the probablty of leavng the clamant count wthn two months of completon of the sxth month on unemployment beneft s 4.4 per cent hgher for treated ndvduals. Ths estmate s not sgnfcant, however, but after four months the estmated effect rses to almost 12 per cent and acheves statstcal sgnfcance. If these are the true parameters, t means that the New Deal s effectvely helpng people out of unemployment qute sgnfcantly. We have reproduced these estmates under weaker assumptons. The second estmator n the table s the dff-n-dffs, usng the frst quarter of 1997 as the before-programme perod. Ths procedure assumes that treatments and comparsons are equally affected by the same macro shocks, but they are allowed to have group-specfc characterstcs that are constant over tme. Gven that the comparson groups are drawn ether from dfferent regons or from a dfferent age-group, ths assumpton may be rather strong. The results obtaned are sgnfcantly hgher than the ones obtaned wth the sngle dfferences procedure: the estmated effects ncrease by over 3 percentage ponts for both outcome varables usng Comparson Group 1. We also consdered the possblty of group-specfc tme effects. Ths suggests the use of a trend-adjusted dff-n-dffs estmator. Ths method does 458
Evaluaton Methods for Non-Expermental Data TABLE 5 Treatment Effects for People Jonng the Programme durng the Frst Quarter of 1998 Comparson Group 1: 19- to 24-year-olds lvng n non-pathfnder areas Sngle dfferences Dff-n-dffs Trendadjusted dff-n-dffs Lnear matchng dff-n-dffs Lnear matchng trendadjusted dff-n-dffs No. of observatons 1,627 3,716 8,556 3,716 8,556 Effect after two months of treatment 0.044 (0.031) 0.082** (0.041) 0.072 (0.056) 0.076* (0.041) 0.062 (0.056) Effect after four months of treatment 0.119** (0.033) 0.152** (0.044) 0.144** (0.061) 0.147** (0.044) 0.135** (0.061) Comparson Group 2: 19- to 24-year-olds lvng n matched non-pathfnder areas Sngle dfferences Dff-n-dffs Trendadjusted dff-n-dffs Lnear matchng dff-n-dffs Lnear matchng trendadjusted dff-n-dffs No. of observatons 683 1,590 3,350 1,590 3,350 Effect after two months of treatment 0.011 (0.036) 0.109** (0.049) 0.073 (0.066) 0.191** (0.052) 0.060 (0.066) Effect after four months of treatment 0.070* (0.039) 0.098** (0.049) 0.180** (0.071) 0.173** (0.052) 0.164** (0.072) Comparson Group 3: 25- to 30-year-olds lvng n Pathfnder areas Sngle dfferences Dff-n-dffs Trendadjusted dff-n-dffs Lnear matchng dff-n-dffs Lnear matchng trendadjusted dff-n-dffs No. of observatons 469 1,096 2,137 1,096 2,137 Effect after two months of treatment 0.060 (0.042) 0.031 (0.055) 0.031 (0.079) 0.031 (0.056) 0.022 (0.080) Effect after four months of treatment 0.154** (0.046) 0.144** (0.061) 0.117 (0.089) 0.137** (0.062) 0.113 (0.089) *Sgnfcant at 10 per cent level. **Sgnfcant at 5 per cent level. Notes: Standard errors are gven n parentheses below the estmate. Trend-adjusted estmates used the 1989:I 1990:I perod. ndeed allow for dstnct tme trends across groups but requres the groupspecfc macro shocks to exhbt cyclcal behavour, repeatng themselves over the cycles. Under ths assumpton, the best choce for the comparson perod s 459
Fscal Studes the comparable part of the prevous cycle. We have used the 1989:I 1990:I perod. The estmates reman at smlar levels for Comparson Group 1, beng pushed down by around 1 percentage pont, but the effect after two months loses statstcal sgnfcance. Fnally, a lnear matchng procedure has been appled. It guarantees that groups wth smlar observable characterstcs are beng compared. We have combned t wth the dff-n-dffs and wth the trend-adjusted dff-n-dffs methods. The necessary assumptons on the group-specfc effects are beng relaxed to the hypothess that ther temporary part s ndependent of partcpaton, gven that we control for a set of observables. However, the use of lnear matchng along wth dff-n-dffs and trend-adjusted dff-n-dffs changes the results for Comparson Group 1 only margnally. When usng Comparson Group 2 a subset of Comparson Group 1 usng the most smlar regons the estmates ncrease, n general, by between 2 and 4 percentage ponts. Ths does not happen, however, wth the sngle dfferences, where the estmates actually fall. Constructng the counterfactual from the older group (Comparson Group 3) weakens the results, especally when consderng the effect of two months of treatment: these estmates are generally lower when usng ths comparson group and none of them s sgnfcant at conventonal levels. The effect after four months of treatment s occasonally estmated wth less precson but at levels very smlar to the ones obtaned when usng Comparson Group 1. Gven the sze of the sample for these comparsons, some loss of statstcal sgnfcance s to be expected. Overall, the estmates gve the same ndcaton, ndependently of the chosen comparson group or estmaton technque: there s a postve and sgnfcant mpact of the programme n takng people out of the beneft account. However, these results are not free from crtcsm. There are a number of reasons why they may not be robust. It could be that the programme tself pushes partcpants out of the clamant count by placng them n optons that they are expected to accept. There may be self-selecton on unobservables that are not controlled for by the matchng and dfferencng methods. A thrd potental crtcsm of the results relates to substtuton. Suppose that the labour suppled by partcpants f at work s substtutable for the labour suppled by workers smlar to but older than the ones we are comparng partcpants to. If partcpants are beng made more effectve at job-searchng and are beng offered subsdsed jobs, t s lkely that they wll take some of the jobs that would have been taken by ther older counterparts. However, wthout very strong assumptons, t s generally not possble to dstngush the substtuton effects from macro shocks. Fnally, the global nature of the programme may also gve rse to wage effects, especally f the target group s relatvely large. These ssues are dscussed more fully n Blundell, Costa Das, Meghr and Van Reenen (2000). 460
Evaluaton Methods for Non-Expermental Data 2. The Method of Matchng: The JTPA Evaluaton n the US A recent study by Heckman, Ichmura and Todd (1997) evaluates matchng methods under dfferent assumptons on the rchness of avalable data. Informaton gathered under the Job Tranng Partnershp Act (JTPA) was used to compare the performance of matchng models wth expermental procedures. The JTPA s the man US government tranng programme for dsadvantaged workers. It provdes on-the-job tranng, job-search assstance and classroom tranng to youths and adults. Elgblty s determned by famly ncome beng near or below the poverty level for sx months pror to applcaton or by partcpaton n federal, state or local welfare and food stamp programmes. Detaled longtudnal data were collected under an expermental settng for a group of treatments and randomsed-out controls, as well as for a potental comparson group of elgble non-partcpants (see Devne and Heckman (1996), Kemple, Dolttle and Wallace (1993) and Orr et al. (1994)). All the groups were resdent n the same narrowly defned geographc regons and were admnstered the same questonnare. The rchness of nformaton also allowed the constructon of close comparson groups from other surveys. As n the LaLonde (1986) study, earnngs are the outcome measure. Wth such data, a formal analyss of estmated bas was possble and thus the relatve advantages of matchng were clearly stated. Let us start by focusng on the results concerned wth the comparablty of supports. Heckman, Ichmura and Todd (1997) draw the denstes of P(X) (probablty of programme partcpaton) for controls and elgble nonpartcpants. It s clear from ths study that the common support defned by the propensty to partcpate s very restrcted. Ths means that the potental nonexpermental comparson group, composed of the elgble non-partcpants, does not reproduce the characterstcs of the treated as represented by the expermental comparson group, composed of the controls. Therefore a sgnfcant source of bas when dealng wth non-expermental data should come from not controllng for non-overlappng support. It s also clear that f the common support s a relatvely small subset of the whole support for the treatment group, then the entre group s unlkely to be represented. Of course, the fact that non-expermental evaluatons use only a small part of the treated support n tryng to avod the non-overlappng support type of bas mples that the parameter beng estmated s not the same as when an experment dataset s avalable. An emprcal decomposton of the evaluaton bas as measured by the average monthly earnngs s presented n Table 6 (see also Table 2 n Heckman, Ichmura and Todd (1997)). As already mentoned, the total evaluaton bas s C C gven by B = EY ( X, d= 1) EY ( X, d= 0). Recall that the total bas may be decomposed nto three parts: the bas due to non-overlappng support of X 461
Fscal Studes ( B 1 ), the bas due to msweghtng on the common support of X ( B 2 ) and the bas resultng from selecton on unobservables ( B 3 ). In ths study, the frst term s estmated wth the controls reported earnngs, whle for the second term three optons are used: elgble non-partcpants, a group based on the Survey on Income and Program Partcpaton (SIPP) and a group of no-shows whch nclude controls and persons assgned to treatment that dropped out before TABLE 6 Bas Decomposton of Smple Dfference n Post-Programme Mean Earnngs Estmator Expermental controls and elgble non-partcpants Mean dfference ˆB Nonoverlap ˆB 1 Densty weghtng ˆB 2 Selecton bas ˆB 3 Average bas B ˆ common B ˆ common as percentage of treatment mpact Adult males 342 218 584 23 38 87% Adult females 33 80 78 31 38 129% Male youth 20 142 131 9 14 23% Female youth 42 74 67 35 49 7,239% Expermental controls and SIPP elgbles Mean dfference ˆB Nonoverlap ˆB 1 Densty weghtng ˆB 2 Selecton bas ˆB 3 Average bas B ˆ common B ˆ common as percentage of treatment mpact Adult males 145 151 417 121 192 440% Adult females 47 97 172 122 198 676% Male youth 188 65 263 9 21 36% Female youth 88 83 168 3 13 1,969% Expermental controls and no-shows Mean dfference ˆB Nonoverlap ˆB 1 Densty weghtng ˆB 2 Selecton bas ˆB 3 Average bas B ˆ common B ˆ common as percentage of treatment mpact Adult males 29 13 3 38 42 97% Adult females 9 1 9 18 20 68% Male youth 84 14 21 91 99 171% Female youth 18 3 31 46 51 7,441% 462
Evaluaton Methods for Non-Expermental Data recevng any servce. The estmated bases result from a smple dfference estmator of treatment mpact. It s clear that types 1 and 2 bas account for the majorty of the error n any case. None the less, selecton bas as correctly defned s a sgnfcant error when compared wth the treatment mpact and s even greater when evaluatng the bas on the common support. Another relevant pont concerns the usage of dfferent datasets to construct the comparson group. The SIPP data panel ncludes nformaton detaled enough to evaluate elgblty, but the precse locaton of respondents s unknown and the survey questons are not exactly the same. As a result, selecton bas for estmates usng ths nformaton s typcally hgher n both absolute and relatve terms. The results obtaned when usng no-shows as a comparson group are qute nterestng. Those people are lkely to be very smlar to the treated. In fact, f non-enrolment were random wth respect to outcomes, they would be just lke the expermental group. Most probably, ths s not the case, but the same matchng methods as the ones used wth elgble non-partcpants can be appled here to control for the dfferences. The thrd panel of Table 6 shows that the bas s substantally lower when usng ths group nstead of elgble non-partcpants (except for male youth) but t s more heavly weghted toward the selecton bas component, B 3. The comparson between dff-n-dffs and sngle dfference matchng estmators usng the group of elgble non-partcpants s reported n Table 7. The outcome measures are the quarterly earnngs for quarters 1 to 6 after treatment. The values presented are estmates of the selecton bas on common support, B S c, from four dfferent matchng estmators respectvely, smple and regresson-adjusted sngle dfferences and dfference-n-dfferences. The regresson-adjusted estmator les between fully non-parametrc and parametrc approaches and s lkely to mprove the results when compared wth completely non-parametrc estmators. It s based on a partcular specfcaton for the notreatment outcomes, lnear say: Y C = Xβ + U C. To estmate the treatment mpact, we should frstly estmate β and then remove ˆ X β from each T Y and C Y observaton. Wth such values, we perform matchng on X or P(X) as desred and estmate the mpact by a smple mean dfference. When usng the dff-ndffs estmator, the removal operaton s requred for each pre- and posttreatment observaton. The estmates n Table 7 are based on kernel weghts. Specfcally, each treatment observaton s matched wth a weghted average of the outcomes for all ndvduals n the comparson group. Local lnear weghts are used because they 463
Fscal Studes Quarter TABLE 7 Estmated Bas for Alternatve Matchng Methods: Expermental Controls and Elgble Non-Partcpants Local lnear matchng Regressonadjusted local lnear matchng Dff-n-dffs local lnear matchng Dff-n-dffs regressonadjusted local lnear matchng Adult males t=1 33 39 97 104 t=2 37 39 77 77 t=3 29 21 90 74 t=4 80 65 112 98 t=5 64 50 19 5 t=6 37 17 4 35 Average, 1 6 47 38 67 52 % of adjusted mpact 77% 62% 109% 85% Adult females t=1 45 55 65 74 t=2 48 55 53 60 t=3 26 31 10 14 t=4 36 35 12 7 t=5 48 48 29 23 t=6 23 16 5 18 Average, 1 6 38 40 27 27 % of adjusted mpact 109% 114% 78% 76% Male youth t=1 3 8 43 80 t=2 40 28 43 61 t=3 33 8 92 70 t=4 44 4 9 5 t=5 84 42 18 11 t=6 28 31 23 64 Average, 1 6 39 7 30 22 % of adjusted mpact 108% 19% 84% 61% Female youth t=1 31 8 7 14 t=2 79 27 60 27 t=3 121 49 135 83 t=4 37 28 45 4 t=5 65 8 45 7 t=6 34 1 31 6 Average, 1 6 61 8 52 17 % of adjusted mpact 248% 33% 209% 67% 464
Evaluaton Methods for Non-Expermental Data enable a faster convergence rate at boundary ponts and adapt better to dfferent data denstes (for more detals, see Heckman, Ichmura and Todd (1997) and Fan (1992)). 10 The frst two columns of Table 7 present the results usng a smple dfference matchng estmator and the last two contan dff-n-dffs results. The last row of each panel shows the bas as a proporton of the estmated expermental mpact on the common support of treatments and elgble non-partcpants. As predcted, the combnaton of non-parametrc and parametrc technques performs better than fully non-parametrc approaches. The dff-n-dffs estmator does better for some groups, but not all. 11 For all estmators presented, there s consderable varaton n the estmated bas over tme. In spte of the consderable mprovements relatve to smpler estmates, the bas remans overly strong as a percentage of the adjusted mpact of treatment. There s stll consderable selecton on unobservables that contamnates the nonexpermental estmates. VI. CONCLUSIONS Ths paper has presented an overvew of alternatve evaluaton methods, focusng on approaches that do not requre expermental data. We have assessed a number of approaches, ncludng the use of selecton, dfference-n-dfferences and propensty score matchng. Drawng on studes from the UK and the US, we have revewed the performance of alternatve methods. The approprate choce of evaluaton method has been shown to depend on a combnaton of the data avalable and the polcy parameter of nterest. Where non-expermental data are all that s avalable, a careful combnaton of matchng and dfferencng can provde useful nsghts nto the mpact of some polcy nterventons. For example, n the study of tranng programmes, t has been found that, where data on local labour market characterstcs and prevous work experence are collected, an approach that combnes propensty score matchng wth the dfference-n-dfferences technque s qute robust. It allows matchng 10 The expresson for the local lnear weghts s the followng: 2 Gj Gk ( Xk X ) Gj ( X j X ) Gk ( Xk X ) k I C k IC WN, (, ), C N j = T 2 2 G G ( X X ) G ( X X ) { } l I l k I k k k I k k C C C where WN, (, ) C N j s the weght for the comparson j when matchng wth the treated, and the numbers of T comparsons and treatments are N C and N T, respectvely. G j s a kernel functon, Gk = G{( X Xk ) / an C }, and a s the band wdth. Fnally, I C s the sample of comparsons. NC 11 It s noteworthy that the dentfyng hypothess underlyng the dff-n-dffs estmator for symmetrc dfferences around the enrolment date (ndependence between the post- and pre-treatment mean dfference and treatment status) was the only one not beng rejected n tests performed by Heckman, Ichmura and Todd (1997). 465
Fscal Studes on pre-programme shocks and, by collectng good local pre-programme labour market hstory data, allows the comparson group to be placed n the same labour market. The methods presented have been dscussed n a comparable framework, and the respectve assumptons requred to estmate the parameter of nterest have been lad out systematcally. We hope that, by dong so, ths revew can provde a useful resource n decdng on an approprate evaluaton method and understandng ts propertes. APPENDIX: THE HECKMAN SELECTION ESTIMATOR The two-step selecton estmator deals wth the selecton bas problem through drect control of the part of the error term that s correlated wth the treatment status ndcator. The procedure s as follows (see Heckman (1979)). Gven the ndependence of Z and V, the probablty of programme partcpaton can be estmated usng dscrete choce analyss. Wth such nformaton for each agent, and along wth knowledge of the jont dstrbuton of the error terms, one can compute the condtonal expectaton of U, t (A.1) EU ( d = 0, Z) = t 1 + F (Prob( d= 0 Z)) t h( t, t )dt dt 1 1 2 2 1 + + Prob( d = 0 Z ) t1 h 1 ( t1, t2)dt2dt 1 F (Prob( d= 0 Z)) EU ( t d = 1, Z ) =, Prob( d = 1 Z ) where F s the cumulatve dstrbuton functon of V. Ths nformaton should be ncorporated n the outcome regresson equaton, jontly wth all the other covarates, as a selecton bas control. The remanng unobservable wll be totally ndependent of treatment status under the accepted hypothess, and therefore the estmator s consstent. BIBLIOGRAPHY Ashenfelter, O. (1978), Estmatng the effect of tranng programs on earnngs, Revew of Economcs and Statstcs, vol. 60, pp. 47 57. and Card, D. (1985), Usng the longtudnal structure of earnngs to estmate the effect of tranng programs, Revew of Economcs and Statstcs, vol. 67, pp. 648 60. Bass, L. (1983), The effect of CETA on the post-program earnngs of partcpants, Journal of Human Resources, vol. 18, pp. 539 56. (1984), Estmatng the effects of tranng programs wth nonrandom selecton, Revew of Economcs and Statstcs, vol. 66, pp. 36 43. 466
Evaluaton Methods for Non-Expermental Data Bell, B., Blundell, R. and Van Reenen, J. (1999), Gettng the unemployed back to work: an evaluaton of the New Deal proposals, Internatonal Tax and Publc Fnance, vol. 6, pp. 339 60. Blundell, R., Costa Das, M., Meghr, C. and Van Reenen, J. (2000), Evaluatng the employment mpact of mandatory job-search assstance: the UK New Deal Gateway, unpublshed manuscrpt, Insttute for Fscal Studes., Dearden, L. and Meghr, C. (1996), The Determnants and Effects of Work-Related Tranng n Brtan, London: Insttute for Fscal Studes., Duncan, A. and Meghr, C. (1998), Estmatng labour supply responses usng tax polcy reforms, Econometrca, vol. 66, pp. 827 61. and MaCurdy, T. (1999), Labor supply: a revew of alternatve approaches, n O. Ashenfelter and D. Card (eds), Handbook of Labor Economcs, Elsever North-Holland. Burtless, G. (1985), Are targeted wage subsdes harmful? Evdence from a wage voucher experment, Industral and Labor Relatons Revew, vol. 39, pp. 105 14. Card, D. and Robns, P. K. (1998), Do fnancal ncentves encourage welfare recpents to work?, Research n Labor Economcs, vol. 17, pp. 1 56. Cochrane, W. and Rubn, D. (1973), Controllng bas n observatonal studes, Sankyha, vol. 35, pp. 417 46. Devne, T. and Heckman, J. (1996), Consequences of elgblty rules for a socal program: a study of the Job Tranng Partnershp Act (JTPA), n S. Polachek (ed.), Research n Labor Economcs, vol. 15, pp. 111 70, Greenwch, CT: JAI Press. Essa, N. and Lebman, J. (1996), Labor supply response to the Earned Income Tax Credt, Quarterly Journal of Economcs, vol. 111, pp. 605 37. Fan, J. (1992), Desgn adaptve nonparametrc regresson, Journal of the Amercan Statstcal Assocaton, vol. 87, pp. 998 1004. Fsher, R. (1951), The Desgn of Experments, sxth edton, London: Olver and Boyd. Hahn, J. (1998), On the role of the propensty score n effcent semparametrc estmaton of average treatment effects, Econometrca, vol. 66, pp. 315 31. Hausman, J. A. and Wse, D. A. (1985), Socal Expermentaton, Chcago: Unversty of Chcago Press for Natonal Bureau of Economc Research. Heckman, J. (1979), Sample selecton bas as a specfcaton error, Econometrca, vol. 47, pp. 153 61. (1990), Varetes of selecton bas, Amercan Economc Revew, vol. 80, pp. 313 18. (1992), Randomzaton and socal program, n C. Mansk and I. Garfnkle (eds), Evaluatng Welfare and Tranng Programs, Cambrdge, MA: Harvard Unversty Press. (1996), Randomzaton as an nstrumental varable estmator, Revew of Economcs and Statstcs, vol. 56, pp. 336 41. (1997), Instrumental varables: a study of the mplct assumptons underlyng one wdely used estmator for program evaluatons, Journal of Human Resources, forthcomng. and Hotz, V. J. (1989), Choosng among alternatve nonexpermental methods for estmatng the mpact of socal programs, Journal of the Amercan Statstcal Assocaton, vol. 84, pp. 862 74., Ichmura, H. and Todd, P. (1997), Matchng as an econometrc evaluaton estmator, Revew of Economc Studes, vol. 64, pp. 605 54. and Robb, R. (1985), Alternatve methods for evaluatng the mpact of nterventons, n Longtudnal Analyss of Labour Market Data, New York: Wley. 467
Fscal Studes and (1986), Alternatve methods for solvng the problem of selecton bas n evaluatng the mpact of treatments on outcomes, n H. Waner (ed.), Drawng Inferences from Self-Selected Samples, Berln: Sprnger Verlag. and Smth, J. (1994), Ashenfelter s dp and the determnants of program partcpaton, Unversty of Chcago, mmeo., and Clements, N. (1997), Makng the most out of program evaluatons and socal experments: accountng for heterogenety n program mpacts, Revew of Economc Studes, vol. 64, pp. 487 536. Kemple, J., Dolttle, F. and Wallace, J. (1993), The Natonal JTPA Study: Ste Characterstcs n Partcpaton Patterns, New York: Manpower Demonstraton Research Corporaton. LaLonde, R. (1986), Evaluatng the econometrc evaluatons of tranng programs wth expermental data, Amercan Economc Revew, vol. 76, pp. 604 20. Orr, L., Bloom, H., Bell, S., Ln, W., Cave, G. and Dolttle, F. (1994), The Natonal JTPA Study: Impacts, Benefts and Costs of Ttle II-A, report to the US Department of Labor, 132, Bethesda, MD: Abt Assocates. Rosenbaum, P. and Rubn, D. B. (1983), The central role of the propensty score n observatonal studes for causal effects, Bometrka, vol. 70, pp. 41 55. and (1984), Reducng bas n observatonal studes usng subclassfcaton on the propensty score, Journal of the Amercan Statstcal Assocaton, vol. 79, pp. 516 24. and (1985), Constructng a control group usng multvarate matched samplng methods that ncorporate the propensty score, Amercan Statstcan, pp. 39 58. Rubn, D. B. (1978), Bayesan nference for causal effects: the role of randomzaton, Annals of Statstcs, vol. 7, pp. 34 58. (1979), Usng multvarate matched samplng and regresson adjustment to control bas n observatonal studes, Journal of the Amercan Statstcal Assocaton, vol. 74, pp. 318 29. 468