How Much Can We Generalize from Impact Evaluations?


 Gladys Gardner
 1 years ago
 Views:
Transcription
1 How Much Can We Generalize from Impact Evaluations? Eva Vivalt New York University April 30, 2015 Abstract Impact evaluations aim to predict the future, but they are rooted in particular contexts and results may not generalize across settings. I founded an organization to systematically collect and synthesize impact evaluation results on a wide variety of interventions in development. These data allow me to answer this and other questions for the first time using a large data set of studies. I examine whether results predict each other and whether variance in results can be explained by program characteristics, such as who is implementing them, where they are being implemented, the scale of the program, and what methods are used. I find that when regressing an estimate on the hierarchical Bayesian metaanalysis result formed from all other studies on the same interventionoutcome combination, the result is significant with a coefficient of , though the R 2 is very low. The program implementer is the main source of heterogeneity in results, with governmentimplemented programs faring worse than and being poorly predicted by the smaller studies typically implemented by academic/ngo research teams, even controlling for sample size. I then turn to examine specification searching and publication bias, issues which could affect generalizability and are also important for research credibility. I demonstrate that these biases are quite small; nevertheless, to address them, I discuss a mathematical correction that could be applied before showing that randomized controlled trials (RCTs) are less prone to this type of bias and exploiting them as a robustness check. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal Bó, Hunt Allcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel, Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, Columbia University, New York University, the World Bank, Cornell University, Princeton University, the University of Toronto, the London School of Economics, the University of Ottawa, and the Australian National University, among others, and participants at the 2015 ASSA meeting and 2013 Association for Public Policy Analysis and Management Fall Research Conference for helpful comments. I am also grateful for the hard work put in by many at AidGrade over the duration of this project, including but not limited to Jeff Qiu, Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Jennifer Ambrose, Naomi Crowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha Jalil, Risa Santoso and Catherine Razeto. 1
2 1 Introduction In the last few years, impact evaluations have become extensively used in development economics research. Policymakers and donors typically fund impact evaluations precisely to figure out how effective a similar program would be in the future to guide their decisions on what course of action they should take. However, it is not yet clear how much we can extrapolate from past results or under which conditions. Further, there is some evidence that even a similar program, in a similar environment, can yield different results. For example, Bold et al. (2013) carry out an impact evaluation of a program to provide contract teachers in Kenya; this was a scaledup version of an earlier program studied by Duflo, Dupas and Kremer (2012). The earlier intervention studied by Duflo, Dupas and Kremer was implemented by an NGO, while Bold et al. compared implementation by an NGO and the government. While Duflo, Dupas and Kremer found positive effects, Bold et al. showed significant results only for the NGOimplemented group. The different findings in the same country for purportedly similar programs point to the substantial contextdependence of impact evaluation results. Knowing this contextdependence is crucial in order to understand what we can learn from any impact evaluation. While the main reason to examine generalizability is to aid interpretation and improve predictions, it would also help to direct research attention to where it is most needed. If generalizability were higher in some areas, fewer papers would be needed to understand how people would behave in a similar situation; conversely, if there were topics or regions where generalizability was low, it would call for further study. With more information, researchers can better calibrate where to direct their attentions to generate new insights. It is wellknown that impact evaluations only happen in certain contexts. For example, Figure 1 shows a heat map of the geocoded impact evaluations in the data used in this paper overlaid by the distribution of World Bank projects (black dots). Both sets of data are geographically clustered, and whether or not we can reasonably extrapolate from one to another depends on how much related heterogeneity there is in treatment effects. Allcott (forthcoming) recently showed that site selection bias was an issue for randomized controlled trials (RCTs) on a firm s energy conservation programs. Microfinance institutions that run RCTs and hospitals that conduct clinical trials are also selected (Allcott, forthcoming), and World Bank projects that receive an impact evaluation 2
3 Figure 1: Growth of Impact Evaluations and Location Relative to Programs The figure on the left shows a heat map of the impact evaluations in AidGrade s database overlaid by black dots indicating where the World Bank has done projects. While there are many other development programs not done by the World Bank, this figure illustrates the great numbers and geographical dispersion of development programs. The figure on the right plots the number of studies that came out in each year that are contained in each of three databases described in the text: 3ie s title/abstract/keyword database of impact evaluations; JPAL s database of affiliated randomized controlled trials; and AidGrade s database of impact evaluation results data. are different from those that do not (Vivalt, 2015). Others have sought to explain heterogeneous treatment effects in metaanalyses of specific topics (e.g. Saavedra and Garcia, 2013, among many others for conditional cash transfers), or to argue they are so heterogeneous they cannot be adequately modelled (e.g. Deaton, 2011; Pritchett and Sandefur, 2013). Impact evaluations are still exponentially increasing in number and in terms of the resources devoted to them. The World Bank recently received a major grant from the UK aid agency DFID to expand its already large impact evaluation works; the Millennium Challenge Corporation has committed to conduct rigorous impact evaluations for 50% of its activities, with some form of credible evaluation of impact for every activity (Millennium Challenge Corporation, 2009); and the U.S. Agency for International Development is also increasingly invested in impact evaluations, coming out with a new policy in 2011 that directs 3% of program funds to evaluation. 1 Yet while impact evaluations are still growing in development, a few thousand are already complete. Figure 1 plots the explosion of RCTs that researchers affiliated with JPAL, a center for development economics research, have completed each year; alongside are the number of developmentrelated impact evaluations released that year according to 3ie, which keeps a direc 1 While most of these are less rigorous performance evaluations, country mission leaders are supposed to identify at least one opportunity for impact evaluation for each development objective in their 35 year plans (USAID, 2011). 3
4 tory of titles, abstracts, and other basic information on impact evaluations more broadly, including quasiexperimental designs; finally, the dashed line shows the number of papers that came out in each year that are included in AidGrade s database of impact evaluation results, which will be described shortly. To summarize, while we do impact evaluation to figure out what will happen in the future, many issues have been raised about how well we can extrapolate from past impact evaluations, and despite the importance of the topic, previously we were unable to do little more than guess or examine the question in narrow settings as we did not have the data. Now we have the opportunity to address speculation, drawing on a large, unique dataset of impact evaluation results. I founded a nonprofit organization dedicated to gathering this data. That organization, Aid Grade, seeks to systematically understand which programs work best where, a task that requires also knowing the limits of our knowledge. To date, AidGrade has conducted 20 metaanalyses and systematic reviews of different development programs. 2 Data gathered through metaanalyses are the ideal data to answer the question of how much we can extrapolate from past results, and since data on these 20 topics were collected in the same way, coding the same outcomes and other variables, we can look across different types of programs to see if there are any more general trends. Currently, the data set contains 637 papers on 210 narrowlydefined interventionoutcome combinations, with the greater database containing 14,993 estimates. The data further allow me to examine a second set of questions revolving around specification searching and publication bias. Specification searching refers to the practice whereby researchers artificially select results that meet the criterion for being considered statistically significant, biasing results. It has been found to be a systematic problem by Gerber and Malhotra in the political science and sociology literature (2008a; 2008b); Simmons and Simonsohn (2011) and Bastardi, Uhlmann and Ross (2011) in psychology; and Brodeur et al. (2012) in economics. I look for evidence of specification searching and publication bias in this large data set, both for its own importance as well as because it is possible that results may appear to be more generalizable merely because they suffer from a common bias. I pay particular attention to randomized controlled trials (RCTs), which are considered the 2 Throughout, I will refer to all 20 as metaanalyses, but some did not have enough comparable outcomes for metaanalysis and became systematic reviews. 4
5 gold standard in the sciences and on which development economics has increasingly relied. It is possible that the method may reduce specification searching due to its emphasis on rigor or the increased odds of publication independent of results. It is also possible that RCTs may be done in more selected settings, leading to results not generalizing as well. I will shed light on both of these potential issues. The outline of this paper is as follows. First, I define generalizability, present some basic statistics about it, and use leaveoneout crossvalidation to check what kinds of study characteristics can help predict another study s results better than placebo data. I also conduct leaveoneout hierarchical Bayesian metaanalyses of all but one result within an interventionoutcome combination and check to what extent these metaanalysis results, which theoretically might provide the best estimates of a given program s effects, predict the result left out. Since some of the analyses will draw upon statistical methods not commonly used in economics, I will use the concrete example of conditional cash transfers (CCTs), which are relatively wellunderstood and on which many papers have been written, to elucidate the issues. Regarding specification searching, I first conduct caliper tests on the distribution of zstatistics, seeing whether there is a disproportionate number of papers just above the threshold for statistical significance compared to those just below the threshold. The data contain both published papers and unpublished working papers, and I examine how much publication bias there appears to be for results that are significant. After examining how much these biases are present, I discuss how one might correct for them when considering generalizability. There is a simple mathematical adjustment that could be made if one were willing to accept the constraints of a fixed effects metaanalysis model. 3 However, I show this would not be an appropriate model for these data. Instead, I turn to using RCTs, which I show do not suffer as much from these biases, as a robustness check. While this paper focuses on results for impact evaluations of development programs, this is only one of the first areas within economics to which these kinds of methods can be applied. In many of the sciences, knowledge is built through a combination of researchers conducting individual studies and other researchers synthesizing the evidence through metaanalysis. This paper begins that natural next step. 3 In particular, Simonsohn et al. (2014) make the assumption of fixed effects. 5
6 2 Theory 2.1 Heterogeneous Treatment Effects I model treatment effects as potentially depending on the context of the intervention. Each impact evaluation is on a particular intervention and covers a number of outcomes. The relationship between an outcome, the inputs that were part of the intervention, and the context of the study is complex. In the simplest model, we can imagine that context can be represented a contextual variable, C, such that: Z j α ` βt j ` δc j ` γt j C j ` ε j (1) where j indexes the individual, Z represents the value of an aggregate outcome such as enrollment rates, T indicates being treated, and C represents a contextual variable, such as the type of agency that implemented the program. 4 In this framework, a particular impact evaluation might explicitly estimate: Z j α ` β 1 T j ` ε j (2) but, as Equation 1 can be rewritten as Z j α `pβ `γc j qt j `δc j `ε j, what β 1 is really capturing is the effect β 1 β ` γc. When C varies, unobserved, in different contexts, the variance of β 1 increases. This is the simplest case. One can imagine that the true state of the world has interaction effects all the way down. Interaction terms are often considered a secondorder problem. However, that intuition could stem from the fact that we usually look for interaction terms within an already fairly homogeneous dataset  e.g. data from a single country, at a single point in time, on a particularly selected sample. Not all aspects of context need matter to an intervention s outcomes. The set of contextual variables can be divided into a critical set on which outcomes depend and an set on which they do not; I will ignore the latter. Further, the relationship between Z and C can vary by intervention or outcome. For example, school meals programs might have more of an effect on younger children, 4 Z can equally well be thought of as the average individual outcome for an intervention. Throughout, I take high values for an outcome to represent a beneficial change unless otherwise noted; if an outcome represents a negative characteristic, like incidence of a disease, its sign will be flipped before analysis. 6
7 but scholarship programs could plausibly affect older children more. If one were to regress effect size on the contextual variable age, we would get different results depending on which intervention and outcome we were considering. Therefore, it will be important in this paper to look only at a restricted set of contextual variables which could plausibly work in a similar way across different interventions. Additional analysis could profitably be done within some interventions, but this is outside the scope of this paper. Generalizability will ultimately depend on the heterogeneity of treatment effects. The next section formally defines generalizability for use in this paper. 2.2 Generalizability: Definitions and Measurement Definition 1 Generalizability is the ability to predict results accurately out of sample. Definition 2 Local generalizability is the ability to predict results accurately in a particular outofsample group. Any empirical work, including this paper, will only be able to address local generalizability. However, I will argue we should not be concerned about this. First, the metaanalysis data were explicitly gathered using very broad inclusion criteria, aiming to capture the universe of studies. Second, by using a large set of studies in various contexts and repeatedly leaving out some of the studies when generating predictions, we can estimate the sensitivity of results to the inclusion of particular papers. In particular, as part of this paper I will systematically leave out one of the studies, predict it based on the other studies, and do this for each study, cycling through the studies. As a robustness check I do this for different subsets of the data. There are several ways to measure predictive power. Most measures of predictive power rely on building a model on a training data set and estimating the fit of that model on an outofsample test data set, or estimating on the same data set but with a correction. Gelman et al. (2013) provides a good summary. I will focus on one of these methods  crossvalidation. In particular, I will use the predicted residual sum of squares (PRESS) statistic, which is closely related to the mean squared error and can also be used to generate an R 2 like statistic. Specifically, to calculate it one follows this procedure: 7
8 1. Start at study i 1 within each interventionoutcome combination. 2. Generate the predicted value of effect size Y i, Y p i, by building a model based on Y i, the effect sizes for all observations in that interventionoutcome except i. For example, regress Y i α`βc i`ε, where C represents a predictor of interest, and then use the estimated coefficients to predict Y p i. Alternatively, generate Y p i as the metaanalysis result from synthesizing Y i. 3. Calculate the squared error, py i py i q Repeat for each i Calculate PRESS ř n py i py i q 2 To aid interpretation, I also calculate the PRESS statistic for placebo data in simulations. The models for predicting Y i are intentionally simple, as both economists and policymakers often do not build complicated models of the effect sizes when drawing inferences from past studies. First, I simply see if interventions, outcomes, interventionoutcomes, region or implementer have any explanatory power in predicting the resultant effect size. I also try using the metaanalysis result M i, obtained from synthesizing the effect sizes Y i, as the predictor of Y i. This makes intuitive sense; we are often in the situation of wanting to predict the effect size in a new context from a metaanalysis result. When calculating the PRESS statistic, I correct the results for attenuation bias using a firstorder approximation described in Gelman et al. (2013) and Tibshirani and Tibshirani (2009). Representing the crossvalidation squared error py i py i q 2 as CV, bias y CV ĚCV. While predictive power is perhaps the most natural measure of generalizability, I will also show results on how impact evaluation results correlate with each other. The difference between correlation and predictive power is clear: it is similar to the difference between an estimated coefficient and an R 2. Impact evaluation results could be correlated so that regressing Y i on explanatory variables like M i could result in a large, significant coefficient on the M i term while still having a low R 2. Indeed, this is what we will see in the data. To create the metaanalysis result M i, I use a hierarchical Bayesian random effects model with an uninformative prior, as described in the next section on metaanalysis. 8
9 2.3 Models Used in MetaAnalyses This paper uses metaanalysis as a tool to synthesize evidence. As a quick review, there are many steps in a metaanalysis, most of which have to do with the selection of the constituent papers. The search and screening of papers will be described in the data section; here, I merely discuss the theory behind how metaanalyses combine results. One of two main statistical models underlie almost all metaanalyses: the fixedeffect model or the randomeffects model. Fixedeffect models assume there is one true effect of a particular program and all differences between studies can be attributed simply to sampling error. In other words: Y i θ ` ε i (3) where θ is the true effect and ε i is the error term. Randomeffects models do not make this assumption; the true effect could potentially vary from context to context. Here, Y i θ i ` ε i (4) θ ` η i ` ε i (5) where θ i is the effect size for a particular study i, θ is the mean true effect size, η i is a particular study s divergence from that mean true effect size, and ε i is the error. When estimating either a fixed effect or random effects model through metaanalysis, a choice must be made: how to weight the studies that serve as inputs to the metaanalysis. Several weighting schemes can be used, but by far the most common to use are inversevariance weights. As the variance is a measure of how certain we are of the effect, this ensures that those results about which we are more confident get weighted more heavily. The variance will contain a betweenstudies term in the case of random effects. Writing the weights as W, the summary effect is simply: M ř k i 1 W iy i ř k i 1 W i (6) 9
10 with standard error b ř 1 k. i 1 W i To build a hierarchical Bayesian model, I first assume the data are normally distributed: Y ij θ i Npθ i, σ 2 q (7) where j indexes the individuals in the study. I do not have individuallevel data, but instead can use sufficient statistics: Y i θ i Npθ i, σ 2 i q (8) where Y i is the sample mean and σ 2 i the sample variance. This provides the likelihood for θ i. I also need a prior for θ i. I assume betweenstudy normality: θ i Npµ, τ 2 q (9) where µ and τ are unknown hyperparameters. Conditioning on the distribution of the data, given by Equation 8, I get a posterior: θ i µ, τ, Y Np ˆθ i, V i q (10) where ˆθ i Y i σi 2 1 σi 2 ` µ τ 2 ` 1 τ 2, V i 1 σ 2 i 1 ` 1 (11) τ 2 I then need to pin down µ τ and τ by constructing their posterior distributions given noninformative priors and updating based on the data. I assume a uniform prior for µ τ, and as the Y i are estimates of µ with variance pσ 2 i ` τ 2 q, obtain: µ τ, Y Npˆµ, V µ q (12) where ˆµ ř i ř i Y i σ 2 i `τ 2 1, V µ σi 2`τ 2 ÿ i 1 1 σ 2 i `τ 2 (13) 10
11 For τ, note that ppτ Y q ppµ,τ Y q ppµ τ,y q. The denominator follows from Equation 12; for the numerator, we can observe that ppµ, τ Y q is proportional to ppµ, τqppy µ, τq, and we know the marginal distribution of Y i µ, τ: Y i µ, τ Npµ, σ 2 i ` τ 2 q (14) I use a uniform prior for τ, following Gelman et al. (2005). This yields the posterior for the numerator: ppµ, τ Y q9ppµ, τq ź i NpY i µ, σ 2 i ` τ 2 q (15) Putting together all the pieces in reverse order, I first simulate τ, then generate ppτ Y q using τ, followed by µ and finally θ i. Unless otherwise noted, I rely on this hierarchical Bayesian random effects model to generate metaanalysis results. 3 Data This paper uses a database of impact evaluation results collected by AidGrade, a U.S. nonprofit research institute that I founded in AidGrade focuses on gathering the results of impact evaluations and analyzing the data, including through metaanalysis. Its data on impact evaluation results were collected in the course of its metaanalyses from (AidGrade, 2015). AidGrade s metaanalyses follow the standard stages: (1) topic selection; (2) a search for relevant papers; (3) screening of papers; (4) data extraction; and (5) data analysis. In addition, it pays attention to (6) dissemination and (7) updating of results. Here, I will discuss the selection of papers (stages 13) and the data extraction protocol (stage 4); more detail is provided in Appendix B. 3.1 Selection of Papers The interventions that were selected for metaanalysis were selected largely on the basis of there being a sufficient number of studies on that topic. Five AidGrade staff members each independently made a preliminary list of interventions for examination; the lists were then combined and searches done for each topic to determine if there were likely to be enough impact evaluations for a meta 11
12 analysis. The remaining list was voted on by the general public online and partially randomized. Appendix B provides further detail. A comprehensive literature search was done using a mix of the search aggregators SciVerse, Google Scholar, and EBSCO/PubMed. The online databases of JPAL, IPA, CEGA and 3ie were also searched for completeness. Finally, the references of any existing systematic reviews or metaanalyses were collected. Any impact evaluation which appeared to be on the intervention in question was included, barring those in developed countries. 5 Any paper that tried to consider the counterfactual was considered an impact evaluation. Both published papers and working papers were included. The search and screening criteria were deliberately broad. There is not enough room to include the full text of the search terms and inclusion criteria for all 20 topics in this paper, but these are available in an online appendix as detailed in Appendix A. 3.2 Data Extraction The subset of the data on which I am focusing is based on those papers that passed all screening stages in the metaanalyses. Again, the search and screening criteria were very broad and, after passing the full text screening, the vast majority of papers that were later excluded were excluded merely because they had no outcome variables in common or did not provide adequate data (for example, not providing data that could be used to calculate the standard error of an estimate, or for a variety of other quirky reasons, such as displaying results only graphically). The small overlap of outcome variables is a surprising and notable feature of the data. Ultimately, the data I draw upon for this paper consist of 14,993 results (doublecoded and then reconciled by a third researcher) across 637 papers covering the 20 types of development program listed in Table 1. 6 For sake of comparison, though the two organizations clearly do different things, at present time of writing this is more impact evaluations than JPAL has published, concentrated in these 20 topics. Unfortunately, only 318 of these papers both overlapped in outcomes with another paper and were able to 5 Highincome countries, according to the World Bank s classification system. 6 Three titles here may be misleading. Mobile phonebased reminders refers specifically to SMS or voice reminders for healthrelated outcomes. Women s empowerment programs required an educational component to be included in the intervention and it could not be an unrelated intervention that merely disaggregated outcomes by gender. Finally, micronutrients were initially too loosely defined; this was narrowed down to focus on those providing zinc to children, but the other micronutrient papers are still included in the data, with a tag, as they may still be useful. 12
13 be standardized and thus included in the main results which rely on interventionoutcome groups. Outcomes were defined under several rules of varying specificity, as will be discussed shortly. Table 1: List of Development Programs Covered Conditional cash transfers Contract teachers Deworming Financial literacy training Improved stoves HIV education Insecticidetreated bed nets Irrigation Microfinance Micro health insurance Safe water storage Micronutrient supplementation Scholarships Mobile phonebased reminders School meals Performance pay Unconditional cash transfers Rural electrification Water treatment Women s empowerment programs 73 variables were coded for each paper. Additional topicspecific variables were coded for some sets of papers, such as the median and mean loan size for microfinance programs. This paper focuses on the variables held in common across the different topics. These include which method was used; if randomized, whether it was randomized by cluster; whether it was blinded; where it was (village, province, country  these were later geocoded in a separate process); what kind of institution carried out the implementation; characteristics of the population; and the duration of the intervention from the baseline to the midline or endline results, among others. A full set of variables and the coding manual is available online, as detailed in Appendix A. As this paper pays particular attention to the program implementer, it is worth discussing how this variable was coded in more detail. There were several types of implementers that could be coded: governments, NGOs, private sector firms, and academics. There was also a code for other (primarily collaborations) or unclear. The vast majority of studies were implemented by academic research teams and NGOs. This paper considers NGOs and academic research teams together because it turned out to be practically difficult to distinguish between them in the studies, especially as the passive voice was frequently used (e.g. X was done without noting who did it). There were only a few private sector firms involved, so they are considered with the other category in this paper. 13
14 Studies tend to report results for multiple specifications. AidGrade focused on those results least likely to have been influenced by author choices: those with the fewest controls, apart from fixed effects. Where a study reported results using different methodologies, coders were instructed to collect the findings obtained under the authors preferred methodology; where the preferred methodology was unclear, coders were advised to follow the internal preference ordering of prioritizing randomized controlled trials, followed by regression discontinuity designs and differencesindifferences, followed by matching, and to collect multiple sets of results when they were unclear on which to include. Where results were presented separately for multiple subgroups, coders were similarly advised to err on the side of caution and to collect both the aggregate results and results by subgroup except where the author appeared to be only including a subgroup because results were significant within that subgroup. For example, if an author reported results for children aged 815 and then also presented results for children aged 1213, only the aggregate results would be recorded, but if the author presented results for children aged 89, 1011, 1213, and 1415, all subgroups would be coded as well as the aggregate result when presented. Authors only rarely reported isolated subgroups, so this was not a major issue in practice. When considering the variation of effect sizes within a group of papers, the definition of the group is clearly critical. Two different rules were initially used to define outcomes: a strict rule, under which only identical outcome variables are considered alike, and a loose rule, under which similar but distinct outcomes are grouped into clusters. The precise coding rules were as follows: 1. We consider outcome A to be the same as outcome B under the strict rule if outcomes A and B measure the exact same quality. Different units may be used, pending conversion. The outcomes may cover different timespans (e.g. encompassing both outcomes over the last month and the last week ). They may also cover different populations (e.g. children or adults). Examples: height; attendance rates. 2. We consider outcome A to be the same as outcome B under the loose rule if they do not meet the strict rule but are clearly related. Example: parasitemia greater than 4000/µl with fever and parasitemia greater than 2500/µl. 14
15 Clearly, even under the strict rule, differences between the studies may exist, however, using two different rules allows us to isolate the potential sources of variation, and other variables were coded to capture some of this variation, such as the age of those in the sample. If one were to divide the studies by these characteristics, however, the data would usually be too sparse for analysis. Interventions were also defined separately and coders were also asked to write a short description of the details of each program. Program names were recorded so as to identify those papers on the same program, such as the various evaluations of PROGRESA. After coding, the data were then standardized to make results easier to interpret and so as not to overly weight those outcomes with larger scales. The typical way to compare results across different outcomes is by using the standardized mean difference, defined as: SMD µ 1 µ 2 σ p where µ 1 is the mean outcome in the treatment group, µ 2 is the mean outcome in the control group, and σ p is the pooled standard deviation. When data are not available to calculate the pooled standard deviation, it can be approximated by the standard deviation of the dependent variable for the entire distribution of observations or as the standard deviation in the control group (Glass, 1976). If that is not available either, due to standard deviations not having been reported in the original papers, one can use the typical standard deviation for the interventionoutcome. I follow this approach to calculate the standardized mean difference, which is then used as the effect size measure for the rest of the paper unless otherwise noted. This paper uses the strict outcomes where available, but the loose outcomes where that would keep more data. For papers which were followups of the same study, the most recent results were used for each outcome. Finally, one paper appeared to misreport results, suggesting implausibly low values and standard deviations for hemoglobin. These results were excluded and the paper s corresponding author contacted. Excluding this paper s results, effect sizes range between 1.5 and 1.8 SD, with an interquartile range of 0 to 0.2 SD. So as to mitigate sensitivity to individual results, especially with the small number of papers in some interventionoutcome groups, I restrict attention to those standardized effect sizes less than 2 SD away from 0, dropping 1 additional observation. I report 15
16 main results including this observation in the Appendix. 3.3 Data Description Figure 2 summarizes the distribution of studies covering the interventions and outcomes considered in this paper. Attention will typically be limited to those interventionoutcome combinations on which we have data for at least three papers, with an alternative minimum of four papers in the Appendix. Table 12 in Appendix C lists the interventions and outcomes and describes their results in a bit more detail, providing the distribution of significant and insignificant results. It should be emphasized that the number of negative and significant, insignificant, and positive and significant results per interventionoutcome combination only provide ambiguous evidence of the typical efficacy of a particular type of intervention. Simply tallying the numbers in each category is known as vote counting and can yield misleading results if, for example, some studies are underpowered. Table 2 further summarizes the distribution of papers across interventions and highlights the fact that papers exhibit very little overlap in terms of outcomes studied. This is consistent with the story of researchers each wanting to publish one of the first papers on a topic. We will indeed see that later papers on the same interventionoutcome combination more often remain as working papers. A note must be made about combining data. When conducting a metaanalysis, the Cochrane Handbook for Systematic Reviews of Interventions recommends collapsing the data to one observation per interventionoutcomepaper, and I do this for generating the within interventionoutcome metaanalyses (Higgins and Green, 2011). Where results had been reported for multiple subgroups (e.g. women and men), I aggregated them as in the Cochrane Handbook s Table 7.7.a. Where results were reported for multiple time periods (e.g. 6 months after the intervention and 12 months after the intervention), I used the most comparable time periods across papers. When combining across multiple outcomes, which has limited use but will come up later in the paper, I used the formulae from Borenstein et al. (2009), Chapter
17 Figure 2: WithinInterventionOutcome Number of Papers 17
18 Table 2: Descriptive Statistics: Distribution of Narrow Outcomes Intervention Number of Mean papers Max papers outcomes per outcome per outcome Conditional cash transfers Contract teachers Deworming Financial literacy HIV/AIDS Education Improved stoves Insecticidetreated bed nets Irrigation Micro health insurance Microfinance Micronutrient supplementation Mobile phonebased reminders Performance pay Rural electrification Safe water storage Scholarships School meals Unconditional cash transfers Water treatment Women s empowerment programs Average Table 3: Differences between vote counting and metaanalysis results Metaanalysis result Negative Insignificant Positive Total Vote counting result and significant and significant Negative Insignificant Positive Total
19 4 Generalizability of Impact Evaluation Results 4.1 Method The first thing I do is to report basic summary statistics. I ask: given a positive, significant result  the kind perhaps most relevant for motivating policy  what proportion of papers on the same interventionoutcome combination find a positive, significant effect, an insignificant effect, or a negative, significant effect? How much do the results of vote counting and metaanalysis diverge? Another key summary statistic is the coefficient of variation. This statistic is frequently used as a measure of dispersion of results and is defined as σ µ, where µ is the mean of a set of results and σ the standard deviation. 7 The set of results under consideration here is defined by the interventionoutcome combination; I also separately look at variation within papers. While the coefficient of variation is a basic statistic, this is the first time it is reported across a wide variety of impact evaluation results. Finally, I look at how much results overlap within interventionoutcome combinations, using the raw, unstandardized data. I then regress the effect size on several explanatory variables, including the leaveoneout metaanalysis result for all but study i within each interventionoutcome combination, M i. Whenever I use M i to predict Y i, I adjust the estimates for sampling variance to avoid attenuation bias. I also cluster standard errors at the interventionoutcome level to guard against the case in which an outlier introduces systematic error within an interventionoutcome. While the main results are based on any interventionoutcome combination covered by at least three papers, as mentioned I try increasing this minimum number of papers in robustness checks that are included in the Appendix. Finally, I construct the PRESS statistic to measure generalizability, as discussed in Section 2.2. In order to be able to say whether the result is large or small, I also calculate an R 2 like statistic for prediction and conduct simulations using placebo data for comparison. The R 2 like statistic is the R 2 P r from prediction: R 2 P r 1 PRESS SS T ot (16) 7 Absolute values are always taken. 19
20 where SS T ot is the total sum of squares. In the simulations, I randomly assign the real effect size data to alternative interventions, outcomes, or other variables and then generate the PRESS statistic and the predicted R 2 P r for these placebo groups. 4.2 Results Summary statistics Summary statistics provide a first look at how results vary across different papers. The average interventionoutcome combination is comprised 37% of positive, significant results; 58% of insignificant results; and 5% of negative, significant results. To gauge how stable results are, suppose we know that one study in a particular interventionoutcome combination found a positive, significant result (the kind of result one might think could influence policy); drawing another study at random from the set, there is a 60% chance the new result will be insignificant and a 8% chance it will be significant and negative, leaving only about a 32% chance it will again be positive and significant. The differences between metaanalysis results and vote counting results are shown in Table 3. Only those interventionoutcomes which had a vote counting winner are included in this table; in other words, it does not include ties. That the metaanalysis result was often positive and significant or negative and significant when the vote counting result was insignificant is likely a function of many impact evaluations being underpowered. In the methods section, I discussed the coefficient of variation, a measure of the dispersion of the impact evaluation findings. Values for the coefficient of variation in the medical literature tend to range from approximately 0.1 to 0.5. Figure 3 shows its distribution in the economics data, across papers within interventionoutcomes as well as within papers. Each of the acrosspaper coefficients of variation was calculated within an interventionoutcome combination; for example, the effects of conditional cash transfer programs on enrollment rates. When a paper reports multiple results, the previously described conventions work to either select one of them or aggregate them so that there is one result per interventionoutcomepaper that is used to calculate the withininterventionoutcome, acrosspaper coefficient of variation. The across 20
21 Figure 3: Distribution of the Coefficient of Variation paper coefficient of variation is thus likely lower than it might otherwise be due to the aggregation process reducing noise. The withinpaper coefficients of variation are calculated where the data include multiple results from the same paper on the same interventionoutcome. There are several reasons a paper may have reported multiple results: multiple time periods were examined; the author used multiple methods; or results for different subgroups were collected. In each of these scenarios, the context is more similar than it typically is across different papers. Variation within a single paper due to different specifications and subgroups has often been neglected in the literature but constituted on average approximately 67% of the variation across papers within a single interventionoutcome combination. As Figure 3 makes clear, the coefficient of variation within the same paper within an interventionoutcome combination is much lower than that across papers. The mean coefficient of variation across papers in the same interventionoutcome combination is 1.9; the mean coefficient of variation for results within the same paper in the same interventionoutcome combination is lower, at 1.2, a difference that is significantly different in a ttest at pă The coefficient of variation clearly depends on the set of results being considered. Outcomes were extremely narrowly defined, as discussed; interventions varied more. For example, a school meals program might disburse different kinds of meals in one study than in another. The contexts also varied, such as in terms of the implementing agency, the age group, the underlying rates 8 All these results are based on truncating the coefficient of variation at 10 as in Figure 3; if one does so at 20, the acrosspaper coefficient of variation rises to 2.1 and the withinpaper to
22 of malnutrition, and so on. The data are mostly too sparse to use this information, however, a few papers considered the same programs; the average coefficient of variation within the same interventionoutcomeprogram combination, across papers, was 1.5. To aid in interpreting the results, I return to the unstandardized values within interventionoutcome combinations. I ask: what is the typical gap between a study s point estimate and the average point estimate within that interventionoutcome combination? How often do the confidence intervals around an estimated effect size overlap within interventionoutcomes? Table 4 presents some results, excluding risk ratios and rate ratios, which are on different scales. The mean absolute difference between a study s point estimate and the average point estimate within that interventionoutcome combination is about 90%. Regarding the confidence intervals, a given result in an interventionoutcome combination will, on average, have a confidence interval that overlaps with about 85% of the confidence intervals of the other results in that interventionoutcome. The point estimate will be contained in the confidence interval of the other studies approximately half of the time Regression results Do results exhibit any systematic variation? This section examines whether generalizability is associated with study characteristics such as the type of program implementer. I first present some OLS results. As Table 5 indicates, there is some evidence that studies with a smaller number of observations have greater effect sizes than studies based on a larger number of observations. This is what we would expect if specification searching were easier for small datasets; this pattern of results would also be what we would expect if power calculations drove researchers to only proceed with studies with small sample sizes if they believed the program would result in a large effect size or if larger studies are less welltargeted. Interestingly, governmentimplemented programs fare worse even controlling for sample size (the dummy variable category left out is Otherimplemented, which mainly consists of collaborations and private sectorimplemented interventions). Studies in the Middle East / North Africa region may appear to do slightly better than those in SubSaharan Africa (the excluded region category), but not much weight should be put on this as very few studies were conducted in the former region. I then turn to the PRESS statistic in Table 6. In this table, all C represent dummy variables 22
23 on the RHS of the regression that is fit; for example, the first row uses the fitted p Y i from regressing Y i α ` řn β nintervention in ` ε i where Intervention comprises dummy variables indicating different interventions. The PRESS statistic and R 2 P r from each regression of Y on assorted C is listed, along with the average PRESS statistic and RP 2 r from the corresponding placebo simulations. It should be noted that, unlike R 2, R 2 P r need not have a lower bound of zero. This is because the predicted residual sum of squares, which is by definition greater than the residual sum of squares, can also be greater than the total sum of squares. The pvalue gives how likely it is the PRESS statistic is from the distribution of simulation PRESS statistics, using the standard deviation from the simulations. 9 As Table 6 shows, one can distinguish the interventions, outcomes, interventionoutcomes, and regions in the data better than chance. The implementer dummy does not have significant predictive power here. One might believe the relatively poor predictive power is due to too many diverse interventions and outcomes being grouped together. I therefore separate out the two interventions with the largest number of studies, CCTs and deworming, to see if patterns are any different within each of these interventions. Results are weaker here (Table 7). While this could partially due to reduced sample sizes, it also suggests that there are many different sources of heterogeneity in the data. Table 8 presents more PRESS statistics, this time using the leaveoneout metaanalysis result from within interventionoutcome combinations to predict the result left out. The significance of the metaanalysis results is striking compared to that of the earlier results, but the low RP 2 r should be noted. The regressions in Table 9 show some of the key takeaways of this paper in an easily digested format. The relationship between the PRESS statistics and a regression model can also be seen by comparing Tables 8 and 9. In both cases, the metaanalysis result is a significant predictor of the effect size, but the R 2 or R 2 P r is also low in both tables. Table 9 also suggests that governmentimplemented programs do not fare as well on average. 9 Simulations were run 100 times for each regression. 23
24 24 Table 4: Differences in Point Estimates Intervention Outcome Mean estimate Mean difference Units Conditional Cash Transfers Attendance rate percentage points Conditional Cash Transfers Birth in a medical facility percentage points Conditional Cash Transfers Enrollment rate percentage points Conditional Cash Transfers Height cm Conditional Cash Transfers Heightforage zscore Conditional Cash Transfers Labor force participation percentage points Conditional Cash Transfers Labor hours hours/week Conditional Cash Transfers Pregnancy rate percentage points Conditional Cash Transfers Probability skilled attendant at delivery percentage points Conditional Cash Transfers Probability unpaid work percentage points Conditional Cash Transfers Retention rate percentage points Conditional Cash Transfers Test scores standard deviations Conditional Cash Transfers Unpaid labor hours hours/week Conditional Cash Transfers Weightforage zscore Conditional Cash Transfers Weightforheight zscore Contract Teachers Test scores standard deviations Deworming Attendance rate percentage points Deworming Birthweight kg Deworming Height cm Deworming Heightforage zscore Deworming Hemoglobin g/dl Deworming Malformations percentage points Deworming Midupper arm circumference cm Deworming Test scores standard deviations Deworming Weight kg Deworming Weightforage zscore Deworming Weightforheight zscore Financial Literacy Has savings percentage points Financial Literacy Probability has taken loan percentage points Financial Literacy Savings current US$ HIV/AIDS Education Pregnancy rate percentage points HIV/AIDS Education Probability has multiple sex partners percentage points HIV/AIDS Education Probability sexually active percentage points HIV/AIDS Education STD prevalence percentage points HIV/AIDS Education Used contraceptives percentage points Irrigation Consumption current US$ Irrigation Total income current US$ Micro Health Insurance Household health expenditures current US$ Micro Health Insurance Probability of inpatient visit percentage points Micro Health Insurance Probability of outpatient visit percentage points
25 25 Microfinance Assets current US$ Microfinance Consumption current US$ Microfinance Probability of owning business percentage points Microfinance Profits current US$ Microfinance Savings current US$ Microfinance Total income current US$ Micronutrients Height cm Micronutrients Heightforage zscore Micronutrients Hemoglobin g/dl Micronutrients Midupper arm circumference cm Micronutrients Test scores standard deviations Micronutrients Weight kg Micronutrients Weightforage zscore Micronutrients Weightforheight zscore Performance Pay Test scores standard deviations Rural Electrification Enrollment rate percentage points Rural Electrification Study time hours/week Rural Electrification Total income current US$ Scholarships Attendance rate percentage points Scholarships Enrollment rate percentage points Scholarships Test scores standard deviations School Meals Enrollment rate percentage points School Meals Heightforage zscore School Meals Test scores standard deviations Unconditional Cash Transfers Enrollment rate percentage points Unconditional Cash Transfers Test scores standard deviations Unconditional Cash Transfers Weightforheight zscore Women s Empowerment Savings current US$ Women s Empowerment Total income current US$
26 Table 5: Regression of Effect Size on Study Characteristics (1) (2) (3) (4) (5) Effect size Effect size Effect size Effect size Effect size b/se b/se b/se b/se b/se Number of *** *** ** observations (100,000s) (0.01) (0.00) (0.01) Governmentimplemented *** ** (0.06) (0.06) Academic/NGOimplemented (0.04) (0.05) RCT (0.04) East Asia (0.03) Latin America (0.04) Middle East/North 0.284** Africa (0.11) South Asia (0.04) Constant 0.120*** 0.199*** 0.080*** 0.114*** 0.201*** (0.00) (0.04) (0.03) (0.02) (0.04) Observations R Table 6: PRESS statistics and RP 2 r : All interventions C dummies P RESS P RESS Sim pvalue RP 2 r RP 2 rsim R2 P r R2 P rsim Intervention Outcome < Intervention & Outcome < Region Implementer One may be concerned that lowquality papers are either inflating or depressing the degree of generalizability that is observed. There are infinitely many ways to measure paper quality ; I consider two. First, I consider only those papers that were randomized controlled trials. Second, I use the most widelyused quality assessment measure, the Jadad scale (Jadad et al., 1996). The Jadad scale asks whether the study was randomized, doubleblind, and whether there was a description of withdrawals and dropouts. A paper gets one point for having each of these charac 26
27 Table 7: Withinintervention PRESS statistics and R 2 P r CCTs C dummies P RESS P RESS Sim pvalue RP 2 r RP 2 rsim R2 P r R2 P rsim Outcome Region Implementer Deworming C dummies P RESS P RESS Sim pvalue RP 2 r RP 2 rsim R2 P r R2 P rsim Outcome Region Implementer Table 8: PRESS statistics and RP 2 r using WithinInterventionOutcome MetaAnalysis Results to Predict Estimates from Different Implementers MetaAnalysis Result P RESS P RESS Sim pvalue RP 2 r RP 2 rsim R2 P r R2 P rsim M (on full sample) < M (on govt only) M (on acad/ngo only) < teristics; in addition, a point is added if the method of randomization was appropriate, subtracted if the method is inappropriate, and similarly added if the blinding method was appropriate and subtracted if inappropriate. This results in a 05 point scale. Given that the kinds of interventions being tested are not typically readily suited to blinding, I consider all those papers scoring at least a 3 to be high quality. Tables 13 and 14 in the Appendix provide robustness checks using these two quality measures. Table 15 also considers those interventionoutcome combinations with at least four papers; Table 16 includes the one observation previously dropped for having an effect size more than 2 SD away from 0; in Table 17, I use a fixed effect metaanalysis to create the leaveoneout metaanalysis results. To illustrate the mechanics more clearly, Figures 4 and 5 show the density of results according to a hierarchical Bayesian model with an uninformative prior. 10 Figure 4 shows the results within one interventionoutcome combination: conditional cash transfers and enrollment rates. The dark dots correspond to the aggregated estimates of the government versions of the interventions; the light, the academic/ngo versions. The still lighter dots found in figures in the Appendix repre 10 Code adapted from Hsiang, Burke and Miguel (2013), based on Gelman et al., 2013). 27
28 Table 9: Regression of Effect Size on Hierarchical Bayesian MetaAnalysis Results (1) (2) (3) Effect size Effect size Effect size b/se b/se b/se Metaanalysis result 0.595*** 0.530*** 0.709*** (0.09) (0.05) (0.14) Observations R The metaanalysis result in the table above was created by synthesizing all but one observation within an interventionoutcome combination; that one observation left out is on the left hand side in the regression. All interventions and outcomes are included in the regression, clustering by interventionoutcome. Column (1) shows the results on the full data set; column (2) shows the results for those effect sizes pertaining to programs implemented by the government; column (3) shows the results for academic/ngoimplemented programs. sent those papers with other implementers (collaborations or private sector implementers). The dashed black line shows the overall weighted mean. The two panels on the right side of Figure 4 give the weighted distribution of effects according to a hierarchical Bayesian model. In the first panel, all interventionimplementer pairs are pooled; in the second, they are disaggregated. This figure graphically depicts what a metaanalysis does. We can make a similar figure for each of the 48 interventionoutcome combinations on which we have sufficient papers (available online; for details, see Appendix A). To summarize results further, we may want to aggregate up to the interventionimplementer level. Figure 5 shows the effects if the metaanalysis were to aggregate across outcomes within an interventionimplementer as if they were independent. This figure is based on only those interventions that have been attempted by both a government agency and an academic team or NGO. The standard errors here are exceptionally small for a reason: each dot has aggregated all the results of all the different papers and outcomes under that interventionimplementer. Every time one aggregates in this fashion, even if one assumes that the multiple observations are correlated and corrects for this, the standard errors will decrease. The standard errors are thus more indicative of how much data I have on those particular programs  not much, for example, in the case of academic/ngoimplemented unconditional cash transfers. Figure 5 is simply for illustrative purposes, as collapsing across outcomes may not make sense if 28
29 Figure 4: Example of Hierarchical Bayesian MetaAnalysis Results: Conditional Cash Transfers and Enrollment Rates Effect size (in SD) Barrera Osorio et al. (2008) Glewwe and Kassouf (2008) Rubio Codina (2003) Gitter and Barham (2008) Fuwa (2001) Ferro, Kassouf and Levison (2007) Attanasio et al. (2010) Ward et al. (2010) Olinto and Souza (2005) Garcia and Hill (2009) Dubois, de Janvry and Sadoulet (2012) Ham (2010) Perova (2010) Chaudhury, Friedman and Onishi (2013) Davis et al. (2002) de Janvry, Finan and Sadoulet (2012) de Brauw and Gilligan (2011) Arraiz and Rozo (2011) Galasso (2006) Galiani and McEwan (2013) Angelucci et al. (2010) Maluccio, Murphy and Regalia (2009) Behrman, Parker and Todd (2004) Akresh, de Walque and Kazianga (2013) Edmonds and Schady (2012) Ferreira, Filmer and Schady (2009) Mo et al. (2013) Baird et al. (2011) All studies Academic (light) vs. Government (dark) This figure provides the point estimate and confidence interval for each paper s estimated effect of a conditional cash transfer program (intervention) on enrollment rates (outcome). The first grey box on the right hand side of the figure shows the aggregate distribution of results using the hierarchical Bayesian procedure, and the second grey box farther to the right shows the distributions for the governmentimplemented and academic/ngoimplemented studies separately. Governmentimplemented programs are denoted by dark grey, while academic/ngoimplemented studies are in lighter grey. The even lighter dots in other figures found in the Appendix represent those papers with other implementers (collaborations or private sector implementers). The data have been converted to standard deviations for later comparison with other outcomes. Conventions were followed to isolate one result per paper. These are detailed in the Appendix and in the online coding manual, but the main criteria were to use the result with the fewest controls; if results for multiple time periods were presented, the time period closest to those examined in other papers in the same interventionoutcome was selected; if results for multiple subgroups were presented, such as different age ranges, results were aggregated as these data were typically too sparse to do subgroup analyses. Thus, the data have already been slightly aggregated in this figure. the outcomes are not comparable. Still, unless one expects a systematic bias affecting governmentimplemented programs differently visavis academic/ngoimplemented programs, it is clear that the distribution of the programs effects looks quite different, and the academic/ngoimplemented programs routinely exhibit higher effect sizes than their governmentimplemented counterparts. Figure 5 also illustrates the limitations of metaanalyses: while the academic/ngoimplemented 29
30 Figure 5: Government and Academic/NGO Implemented Projects Differ Within the Same Interventions Effect size (in SD) Conditional cash transfers Microfinance Contract teachers Micronutrient supplementation Deworming Performance pay Financial literacy School meals HIV/AIDS Education All studies Academic (light) vs. Government (dark) This figure focuses only on those interventions for which there were both governmentimplemented and academic/ngoimplemented studies. All outcomes and papers are aggregated within each interventionimplementer combination. While it appears that the academic/ngoimplemented studies do better overall, the higher weighting of some of the better governmentimplemented interventions (in particular, conditional cash transfer programs, which tend to have very large sample sizes) disguises this. This figure both illustrates that academic/ngoimplemented programs seem to do better than governmentimplemented programs and shows why caution must be taken in interpreting results. studies do better overall, the higher weighting of some of the better governmentimplemented interventions hides this in the results in the side panel. This again points to the fact that what one is aggregating over and the weighting scheme used is important. 5 Specification searching and publication bias Results on generalizability could be biased in the presence of specification searching and publication bias. In particular, if studies are systematically biased, they could speciously appear to be more generalizable. In this section, I examine these issues. First, I test for specification searching and publication bias, finding that these biases are quite limited in my data, especially among randomized controlled trials. I then suggest a mathematical correction that could be applied, much in 30
31 the spirit of Simonsohn et al. s pcurve (2014), which looks at the distribution of pvalues one would expect given a true effect size. I run some simulations that show that my mathematical correction would recover the correct distribution of effect sizes. However, since that approach depends on a key assumption that there is one true effect size and all deviations are noise, and one might want to weaken that assumption, I show how one might do that. Finally, I restrict attention to just the subset of studies that had been RCTs, as I showed they did not appear subject to the same biases, and repeat the earlier analyses. 5.1 Specification searching: how bad is it? Method To examine the issue of specification searching, I start by conducting a series of caliper tests, following Gerber and Malhotra (2008a). As they describe (2008b), even if results are naturally concentrated in a given range, one should expect to see roughly comparable numbers of results just on either side of any threshold when restricting attention to a narrow enough band. I consider the ranges 2.5%, 5%, 10%, 15% and 20% above and below z=1.96, in turn, and examine whether results follow a binomial distribution around 1.96 as one would expect in the absence of bias. I do these tests on the full data set  here there is no need to consider only those results for interventionoutcome combinations covered by a certain number of papers, for example  but then also break it down in several ways, such as by RCT or nonrct and governmentimplemented or nongovernmentimplemented. When doing this kind of analysis, one should also carefully consider the issues arising from having multiple coefficients coming from the same papers. Gerber and Malhotra (2008a; 2008b) address the issue by breaking down their results by the number of coefficients contributed by each paper, so as to separately show the results for those papers that contribute one coefficient, two coefficients, and so on. I also do this, but in addition use the common statistical method of aggregating the results by paper, so that, for example, a paper with four coefficients below the threshold and three above it would be counted as below. The approach followed in Gerber and Malhotra (2008a; 2008b) retains slightly different information. While it preserves the number of coefficients on either side of the threshold, it does not reduce the bias that may be present if one 31
32 or two of the papers are responsible for much of the effect. By presenting a set of results collapsed by paper, I can test if results are sensitive to this Results I begin by simply plotting the distribution of zstatistics in the data for different groups. The distributions, shown in Figure 6, are consistent with specification searching, particularly for governmentimplemented programs and nonrcts: while noise remains, mostly from using multiple results from the same study, which tend to be clustered, there appears to be a bit of a deviation from the downward trend around 1.96, the threshold for statistical significance at the 5% level for a twosided test. These are derounded figures, accounting for the fact that papers may have presented results which were imprecise; for example, specifying a point estimate of 0.03 and a standard error of Since these results would artificially cause spikes in the distribution, I redrew their zstatistics from the uniform range of possible results ( ), as in Brodeur et al. (2012). Figure 6: Distribution of zstatistics This figure shows histograms of the zstatistics, by implementer and whether the result was from an RCT. A jump around 1.96, the threshold for significance at the 5% level, would suggest that authors were wittingly or unwittingly selecting significant results for inclusion. 32
33 Table 10: Caliper Tests: By Result Over Caliper Under Caliper pvalue All studies 2.5% Caliper <0.10 5% Caliper % Caliper % Caliper % Caliper RCTs 2.5% Caliper % Caliper % Caliper % Caliper % Caliper NonRCTs 2.5% Caliper <0.01 5% Caliper < % Caliper % Caliper % Caliper Table 11: Caliper Tests: By Paper Over Caliper Under Caliper pvalue All studies 2.5% Caliper % Caliper % Caliper % Caliper % Caliper RCTs 2.5% Caliper % Caliper % Caliper < % Caliper % Caliper <0.10 NonRCTs 2.5% Caliper 19 7 <0.05 5% Caliper % Caliper % Caliper % Caliper
34 Overall, these figures look much better than the typical ones in the literature. I designed AidGrade s coding conventions partially to minimize bias, 11 which could help explain the difference. The government zstatistics are perhaps the most interesting. While they may reflect noise rather than bias, it would be intuitive for governments to exert pressure over their evaluators to find significant, positive effects. Suppose there are two ways of obtaining a significant effect size: putting effort into the intervention (such as increasing inputs) or putting effort into leaning on the evaluators. For largescale projects, it would seem much more efficient to target the evaluator. While this story would be consistent with the observed evidence, many other explanations remain possible. Turning to the caliper tests, I still find little evidence of bias (Tables 10 and 11). For the caliper tests, I use the raw rather than derounded data, as derounding could mask subtle jumps at z=1.96. Whether considering results independently or collapsed by paper, nonrcts appear to suffer from bias, but RCTs perform much better. It should be recalled that as the distribution of the zstatistics is skewed, we should expect to see fewer results just over as opposed to just under the threshold for significance for a wide enough band, which is indeed what we see for RCTs. The results, especially for RCTs, mark a great difference from Gerber and Malhotra s results on the political science (2008a) or sociology (2008b) literature. I reproduce one of their tables in the Appendix to illustrate (Table 19), as the difference between my results and their row of mostly p ă results is striking. 5.2 Publication bias Turning to publication bias, published impact evaluations are more likely to have significant findings than working papers, as we can see in Table 12. RCTs are also greatly selected for publication. However, once controlling for whether a study is an RCT, publication bias is reduced. As discussed, an attempt was made to be very comprehensive in the data gathering process, and both published and unpublished papers were searched for and included. I can therefore rerun the main regressions separately for published and unpublished papers (Table 18 in the Appendix). Results are fairly similar. The coefficient on the metaanalysis term is marginally not significant 11 Such as by focusing on the specifications with fewest controls and only collecting subgroup data where results for all subgroups were reported. 34
35 Table 12: Publication Bias (1) (2) (3) Published Published Published b/se b/se b/se Published RCT 4.429*** 2.796*** (0.85) (1.11) Significant 1.855*** (0.27) (0.37) RCT*Significant (0.59) Observations Exponentiated coefficients Each column presents the results of a logistic regression of whether a study or result was published on different characteristics. Column (1) considers papers; Columns (2) and (3) consider results for simplicity, though the decision of whether or not a paper will be published could be a more complicated function of the individual results significance. for unpublished papers (p=0.12), but this appears likely influenced by the small sample size; the coefficient is of a similar magnitude, and if I run the metaanalyses on the full data set and then restrict attention to the subset of unpublished papers (reducing noise), rather than first restricting attention to the unpublished papers and then generating the metaanalysis results, the coefficient is again highly significant (p 0.001). 12 In short, published and unpublished papers appear roughly comparable in terms of generalizability; it therefore can be hoped that if any papers were not only not published but also file drawered, i.e. not even available as a working paper, this may not be too much of a concern. Since I cannot say with certainty that file drawered studies are similar to unpublished studies, or that file drawered studies are necessarily to unpublished studies what unpublished studies are to published studies, this paper ultimately can only speak to the results of papers that do come out in some form. Finally, we might believe that earlier or later papers on the same interventionoutcome combination might show systematically different results. On the one hand, later authors might face pressure to find large or different results from earlier studies in order to publish better. In other disciplines, one can also sometimes observe a decline effect, whereby early, published results were 12 Results are available upon request. 35
36 Figure 7: Variance of Results Over Time, Within InterventionOutcome much more promising than later replications, possibly due to publication bias. To investigate this, I generate the absolute value of the percent difference between a particular result and the mean result in that interventionoutcome combination. I then compare this with the chronological order of the paper relative to others on the same interventionoutcome, scaled to run from 0 to 1. For example, if there were 5 papers on a particular interventionoutcome combination, the first would take the value 0.2, the last, 1. Figure 7 provides a scatter plot of the relationship between the absolute percent difference and the chronological order variables, restricting attention to those percent differences less than 1000%. There is a weak positive relationship between them, indicating that earlier results tend to be closer to the mean result than the later results, which are more variable, but this is not significant. Further, the relationship varies according to the cutoff used. Table 20 in Appendix C illustrates. One could also look at how results vary by journal rankings, but there are relatively few papers published in economics journals that have a ranking assigned to them as most rankings focus on e.g. the top 500, as in RePEc, so the data would not yet be sufficient; future work can address this issue. 36
37 5.3 Identifying bias vs. heterogeneous effects Generalizability and specification searching and publication bias are intimately related. This can be seen by considering how one might correct the bias among the papers in a single interventionoutcome combination. If one were to consider these papers to be estimating the same true effect, as in the fixed effects model, 13 one would immediately know the variance of the effect sizes. In a fixed effects model, it is always the same. Y i is normally distributed, with variance equal to 2{n. This can be seen by considering that the only variation in Y i is assumed to be due to sampling variance, σ 2 {n, and since Y i is standardized, σ 2 =1, resulting in 2{n for two samples. This implies that if one believed the fixed effects model were true, one would not even need to correct the effect size estimates Y i in order to know their variance. For some of the measures of generalizability previously discussed, such as a simple PRESS statistic which tried to see how well the group mean effect size, excepting i, predicted Y i, we can pin down the correct statistic using only the sample size. 14 This serves to underscore the inappropriateness of the fixed effect model when considering generalizability. Bias and generalizability are not separately identified by themselves. If one brought data from a random effects model and tried to fit it to a fixed effects model, the real differences in effect size would appear as bias. There are a few possible solutions. A random effects model would seem to be better as it allows the possibility of heterogeneous treatment effects. However, one would still have to be confident that one could distinguish bias from heterogeneous effects; one needs another lever. One possibility would be to exploit different levels of the data. For example, perhaps if one looked within very narrow groups of results that could plausibly share a true effect size, one could reasonably consider each of those groups of results as fitting a separate fixed effects model and correct them before aggregating up in a hierarchical model. Concretely, if the goal were to have an unbiased random effects model within a particular interventionoutcome combination, one could correct the data at the paper level, supposing there were enough results within a paper on the 13 Used in e.g. Simonsohn et al. (2014). 14 Other measures, such as the coefficient of variation, would also require adjusting θ, but this could also be easily obtained as in Simonsohn et al. (2014). 37
38 same outcome. I do not take this approach, because apart from being dataintensive, this approach requires having homogenous subgroups and, as previously shown, effect size estimates vary quite a bit even within the same interventionoutcomepaper in the data. Still, if one wanted to take this approach, I include simulations in Appendix D that show that one can indeed recover the original distribution of effect sizes to a close approximation. Instead, I turn to a different robustness check. Since I previously found that not only are the biases quite small overall, but RCTs exhibit even fewer signs of specification searching and publication bias, I thus rerun the main regression relating to generalizability on the subset of the data that were RCTs (Table 13, Appendix C). 6 Conclusion How much impact evaluation results generalize to other settings is an important topic, and data from metaanalyses are the ideal data with which to answer this question. With data on 20 different types of interventions, all collected in the same way, we can begin to speak a bit more generally about how results tend to vary across contexts and what that implies for impact evaluation design and policy recommendations. I started by defining generalizability and relating it to heterogeneous treatment effects models. After examining key summary statistics, I conducted leaveoneout hierarchical Bayesian metaanalyses within different narrowlydefined interventionoutcome combinations, separating the metaanalyses into different specifications by type of implementer. The results of these metaanalyses were significantly associated with the effect size left out at each iteration, with the typical coefficient on the metaanalysis result being approximately ; it would be 1 if the estimate were identical to the metaanalysis result. However, the metaanalysis results were not significantly associated with the results of studies implemented by governments. Further, the effect sizes of governmentimplemented programs appeared to be lower even after controlling for sample size. This points to a potential problem when results from an academic/ngoimplemented study are expected to scale through government implementation. Specification searching and publication bias could affect both impact evaluation results and their generalizability, and I next turned to examine these topics. While each issue is present, neither 38
39 turns out to be particularly important in the data. RCTs fare better than nonrcts. Overall, impact evaluation results were very heterogeneous. Even within the same paper, where one might expect contexts to be very similar, there was a high degree of variation within the same interventionoutcome combination. When considering the coefficient of variation, a unitless figure that can be compared across outcomes, withinpaper variation was 67% of acrosspaper variation. Both, however, were quite high, at 1.2 and 1.9, respectively; in comparison, results in the medical literature might have coefficients of variation of about The average acrosspaper, withininterventionoutcome coefficient of variation for the set of papers covering the same program was 1.5. There are some steps that researchers can take that may improve the generalizability of their own studies. First, just as with heterogeneous selection into treatment (Chassang, Padró i Miquel and Snowberg, 2012), one solution would be to ensure one s impact evaluation varied some of the contextual variables that we might think underlie the heterogeneous treatment effects. Given that many studies are underpowered as it is, that may not be likely; however, large organizations and governments have been supporting more impact evaluations, providing more opportunities to explicitly integrate these analyses. Efforts to coordinate across different studies, asking the same questions or looking at some of the same outcome variables, would also help. The framing of heterogeneous treatment effects could also provide positive motivation for replication projects in different contexts: different findings would not necessarily negate the earlier ones but add another level of information. In summary, generalizability is not binary but something that we can measure. This paper showed that past results have significant but limited ability to predict other results on the same topic and this was not seemingly due to bias. Knowing how much results tend to extrapolate and when is critical if we are to know how to interpret an impact evaluation s results or apply its findings. 39
40 References AidGrade (2013). AidGrade Process Description, processmapandmethodology, March 9, AidGrade (2015). AidGrade Impact Evaluation Data, Version 1.1. Alesina, Alberto and David Dollar (2000). Who Gives Foreign Aid to Whom and Why?, Journal of Economic Growth, vol. 5 (1). Allcott, Hunt (forthcoming). Site Selection Bias in Program Evaluation, Quarterly Journal of Economics. Bastardi, Anthony, Eric Luis Uhlmann and Lee Ross (2011). Wishful Thinking: Belief, Desire, and the Motivated Evaluation of Scientific Evidence, Psychological Science. Becker, Betsy Jane and MengJia Wu (2007). The Synthesis of Regression Slopes in Meta Analysis, Statistical Science, vol. 22 (3). Bold, Tessa et al. (2013). Scalingup What Works: Experimental Evidence on External Validity in Kenyan Education, working paper. Borenstein, Michael et al. (2009). Introduction to MetaAnalysis. Wiley Publishers. Boriah, Shyam et al. (2008). Similarity Measures for Categorical Data: A Comparative Evaluation, in Proceedings of the Eighth SIAM International Conference on Data Mining. Brodeur, Abel et al. (2012). Star Wars: The Empirics Strike Back, working paper. Cartwright, Nancy (2007). Hunting Causes and Using Them: Approaches in Philosophy and Economics. Cambridge: Cambridge University Press. Cartwright, Nancy (2010). What Are Randomized Controlled Trials Good For?, Philosophical Studies, vol. 147 (1): Casey, Katherine, Rachel Glennerster, and Edward Miguel (2012). Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan. Quarterly Journal of Economics, vol. 127 (4): Chassang, Sylvain, Gerard Padr I Miquel, and Erik Snowberg (2012). Selective Trials: A Principal Agent Approach to Randomized Controlled Experiments. American Economic Review, vol. 102 (4): Deaton, Angus (2010). Instruments, Randomization, and Learning about Development. Journal 40
41 of Economic Literature, vol. 48 (2): Duflo, Esther, Pascaline Dupas and Michael Kremer (2012). School Governance, Teacher Incentives and PupilTeacher Ratios: Experimental Evidence from Kenyan Primary Schools, NBER Working Paper. Evans, David and Anna Popova (2014). Costeffectiveness Measurement in Development: Accounting for Local Costs and Noisy Impacts, World Bank Policy Research Working Paper, No Ferguson, Christopher and Michael Brannick (2012). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of metaanalyses. Psychological Methods, vol. 17 (1), Mar 2012, Franco, Annie, Neil Malhotra and Gabor Simonovits (2014). Publication Bias in the Social Sciences: Unlocking the File Drawer, Working Paper. Gerber, Alan and Neil Malhotra (2008a). Do Statistical Reporting Standards Affect What Is Published? Publication Bias in Two Leading Political Science Journals, Quarterly Journal of Political Science, vol 3. Gerber, Alan and Neil Malhotra (2008b). Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results, Sociological Methods &Research, vol. 37 (3). Gelman, Andrew et al. (2013). Bayesian Data Analysis, Third Edition, Chapman and Hall/CRC. Hedges, Larry and Therese Pigott (2004). The Power of Statistical Tests for Moderators in Meta Analysis, Psychological Methods, vol. 9 (4). Higgins JPT and S Green, (eds.) (2011). Cochrane Handbook for Systematic Reviews of Interventions, Version [updated March 2011]. The Cochrane Collaboration. Available from Hsiang, Solomon, Marshall Burke and Edward Miguel (2013). Quantifying the Influence of Climate on Human Conflict, Science, vol Independent Evaluation Group (2012). World Bank Group Impact Evaluations: Relevance and Effectiveness, World Bank Group. Jadad, A.R. et al. (1996). Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Controlled Clinical Trials, 17 (1):
42 Millennium Challenge Corporation (2009). Key Elements of Evaluation at MCC, presentation June 9, Page, Matthew, McKenzie, Joanne and Andrew Forbes (2013). Many Scenarios Exist for Selective Inclusion and Reporting of Results in Randomized Trials and Systematic Reviews, Journal of Clinical Epidemiology, vol. 66 (5). Pritchett, Lant and Justin Sandefur (2013). Context Matters for Size: Why External Validity Claims and Development Practice Don t Mix, Center for Global Development Working Paper 336. Rodrik, Dani (2009). The New Development Economics: We Shall Experiment, but How Shall We Learn?, in What Works in Development? Thinking Big, and Thinking Small, ed. Jessica Cohen and William Easterly, Washington, D.C.: Brookings Institution Press. Saavedra, Juan and Sandra Garcia (2013). Educational Impacts and CostEffectiveness of Conditional Cash Transfer Programs in Developing Countries: A MetaAnalysis, CESR Working Paper. Simmons, Joseph and Uri Simonsohn (2011). FalsePositive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant, Psychological Science, vol. 22. Simonsohn, Uri et al. (2014). PCurve: A Key to the File Drawer, Journal of Experimental Psychology: General. Tibshirani, Ryan and Robert Tibshirani (2009). A Bias Correction for the Minimum Error Rate in CrossValidation, Annals of Applied Statistics, vol. 3 (2). Tierney, Michael J. et al. (2011). More Dollars than Sense: Refining Our Knowledge of Development Finance Using AidData, World Development, vol. 39. Tipton, Elizabeth (2013). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts, Journal of Educational and Behavioral Statistics, 38: RePEc (2013). RePEc hindex for journals, top.journals.hindex.html. Vivalt, Eva (2015). Selection Bias in Impact Evaluations: Evidence from the World Bank, Working Paper. Walsh, Michael et. al. (2013). The Statistical Significance of Randomized Controlled Trial Results 42
43 is Frequently Fragile: A Case for a Fragility Index, Journal of Clinical Epidemiology. USAID (2011). Evaluation: Learning from Experience, USAID Evaluation Policy, Washington, DC. 43
44 Appendices A Guide to Appendices A.1 Appendices in this Paper B) Excerpt from AidGrade s Process Description (2013). C) Additional results. D) Simulations showing the recoverability of the distribution of unbiased data in fixed effect models. A.2 Further Online Appendices Having to describe data from twenty different metaanalyses and systematic reviews, I must rely in part on online appendices. The following are available at E) The search terms and inclusion criteria for each topic. F) The references for each topic. G) The coding manual. H) Figures showing hierarchical Bayesian metaanalysis results for each interventionoutcome combination. 44
45 B Data Collection B.1 Description of AidGrade s Methodology The following details of AidGrade s data collection process draw heavily from AidGrade s Process Description (AidGrade, 2013). Figure 8: Process Description Stage 1: Topic Identification AidGrade staff members were asked to each independently make a list of at least thirty international development programs that they considered to be the most interesting. The independent lists were appended into one document and duplicates were tagged and removed. Each of the 45
46 remaining topics was discussed and refined to bring them all to a clear and narrow level of focus. Pilot searches were conducted to get a sense of how many impact evaluations there might be on each topic, and all the interventions for which the very basic pilot searches identified at least two impact evaluations were shortlisted. A random subset of the topics was selected, also acceding to a public vote for the most popular topic. Stage 2: Search Each search engine has its own peculiarities. In order to ensure all relevant papers and few irrelevant papers were included, a set of simple searches was conducted on different potential search engines. First, initial searches were run on AgEcon; British Library for Development Studies (BLDS); EBSCO; Econlit; Econpapers; Google Scholar; IDEAS; JOLISPlus; JSTOR; Oxford Scholarship Online; Proquest; PubMed; ScienceDirect; SciVerse; SpringerLink; Social Science Research Network (SSRN); Wiley Online Library; and the World Bank elibrary. The list of potential search engines was compiled broadly from those listed in other systematic reviews. The purpose of these initial searches was to obtain information about the scope and usability of the search engines to determine which ones would be effective tools in identifying impact evaluations on different topics. External reviews of different search engines were also consulted, such as a Falagas et al. (2008) study which covered the advantages and differences between the Google Scholar, Scopus, Web of Science and PubMed search engines. Second, searches were conducted for impact evaluations of two test topics: deworming and toilets. EBSCO, IDEAS, Google Scholar, JOLISPlus, JSTOR, Proquest, PubMed, ScienceDirect, SciVerse, SpringerLink, Wiley Online Library and the World Bank elibrary were used for these searches. 9 search strings were tried for deworming and up to 33 strings for toilets, with modifications as needed for each search engine. For each search the number of results and the number of results out of the first results which appeared to be impact evaluations of the topic in question were recorded. This gave a better sense of which search engines and which kinds of search strings would return both comprehensive and relevant results. A qualitative assessment of the search results was also provided for the Google Scholar and SciVerse searches. Finally, the online databases of JPAL, IPA, CEGA and 3ie were searched. Since these databases are already narrowly focused on impact evaluations, attention was restricted to simple keyword 46
47 searches, checking whether the search engines that were integrated with each database seemed to pull up relevant results for each topic. Ultimately, Google Scholar and the online databases of JPAL, IPA, CEGA and 3ie, along with EBSCO/PubMed for healthrelated interventions, were selected for use in the full searches. After the interventions of interest were identified, search strings were developed and tested using each search source. Each search string included methodologyspecific stock keywords that narrowed the search to impact evaluation studies, except for the search strings for the JPAL, IPA, CEGA and 3ie searches, as these databases already exclusively focus on impact evaluations. Experimentation with keyword combinations in stages 1.4 and 2.1 was helpful in the development of the search strings. The search strings could take slightly different forms for different search engines. Search terms were tailored to the search source, and a full list is included in an appendix. C# was used to write a script to scrape the results from search engines. The script was programmed to ensure that the Boolean logic of the search string was properly applied within the constraints of each search engines capabilities. Some sources were specialized and could have useful papers that do not turn up in simple searches. The papers listed on JPAL, IPA, CEGA and 3ies websites are a good example of this. For these sites, it made more sense for the papers to be manually searched and added to the relevant spreadsheets. After the automated and manual searches were complete, duplicates were removed by matching on author and title names. During the title screening stage, the consolidated list of citations yielded by the scraped searches was checked for any existing metaanalyses or systematic reviews. Any papers that these papers included were added to the list. With these references added, duplicates were again flagged and removed. Stage 3: Screening Generic and topicspecific screening criteria were developed. The generic screening criteria are detailed below, as is an example of a set of topicspecific screening criteria. The screening criteria were very inclusive overall. This is because AidGrade purposely follows a different approach to most metaanalyses in the hopes that the data collected can be reused by researchers who want to focus on a different subset of papers. Their motivation is that vast 47
48 resources are typically devoted to a metaanalysis, but if another team of researchers thinks a different set of papers should be used, they will have scour the literature and recreate the data from scratch. If the two groups disagree, all the public sees are their two sets of findings and their reasoning for selecting different papers. AidGrade instead strives to cover the superset of all impact evaluations one might wish to include along with a list of their characteristics (e.g. where they were conducted, whether they were randomized by individual or by cluster, etc.) and let people set their own filters on the papers or select individual papers and view the entire space of possible results. 48
49 Figure 9: Generic Screening Criteria Category Inclusion Criteria Exclusion Criteria Methodologies Impact evaluations that have counterfactuals Observational studies, strictly qualitative studies Publication status Peerreviewed or working paper N/A Time period of study Any N/A LocationGeography Any N/A Quality Any N/A Figure 10: TopicSpecific Criteria Example: Formal Banking Category Inclusion Criteria Exclusion Criteria Intervention Formal banking services specifically including: Other formal banking services  Expansion of credit and/or savings Microfinance  Provision of technological innovations  Introduction or expansion of financial education, or other program to increase financial literacy or awareness Outcomes  Individual and household income N/A  Small and microbusiness income  Household and business assets  Household consumption  Small and microbusiness investment  Small, microbusiness or agricultural output  Measures of poverty  Measures of wellbeing or stress  Business ownership  Any other outcome covered by multiple papers Figure 11 illustrates the difference. For this reason, minimal screening was done during the screening stage. Instead, data was collected broadly and rescreening was allowed at the point of doing the analysis. This is highly beneficial for the purpose of this paper, as it allows us to look at the largest possible set of papers and all subsets. After screening criteria were developed, two volunteers independently screened the titles to determine which papers in the spreadsheet were likely to meet the screening criteria developed in Stage 3.1. Any differences in coding were arbitrated by a third volunteer. All volunteers received training before beginning, based on the AidGrade Training Manual and a test set of entries. Volunteers training inputs were screened to ensure that only proficient volunteers would be allowed to continue. 49
50 Figure 11: AidGrade s Strategy 50
51 Of those papers that passed the title screening, two volunteers independently determined whether the papers in the spreadsheet met the screening criteria developed in Stage 3.1 judging by the paper abstracts. Any differences in coding were again arbitrated by a third volunteer. The full text was then found for those papers which passed both the title and abstract checks. Any paper that proved not to be a relevant impact evaluation using the aforementioned criteria was discarded at this stage. Stage 4: Coding Two AidGrade members each independently used the data extraction form developed in Stage 4.1 to extract data from the papers that passed the screening in Stage 3. Any disputes were arbitrated by a third AidGrade member. These AidGrade members received much more training than those who screened the papers, reflecting the increased difficulty of their work, and also did a test set of entries before being allowed to proceed. The data extraction form was organized into three sections: (1) general identifying information; (2) paper and study characteristics; and (3) results. Each section contained qualitative and quantitative variables that captured the characteristics and results of the study. Stage 5: Analysis A researcher was assigned to each metaanalysis topic who could specialize in determining which of the interventions and results were similar enough to be combined. If in doubt, researchers could consult the original papers. In general, researchers were encouraged to focus on all the outcome variables for which multiple papers had results. When a study had multiple treatment arms sharing the same control, researchers would check whether enough data was provided in the original paper to allow estimates to be combined before the metaanalysis was run. This is a best practice to avoid doublecounting the control group; for details, see the Cochrane Handbook for Systematic Reviews of Interventions (2011). If a paper did not provide sufficient data for this, the researcher would make the decision as to which treatment arm to focus on. Data were then standardized within each topic to be more comparable before analysis (for example, units were converted). The subsequent steps of the metaanalysis process are irrelevant for the purposes of this paper. 51
52 It should be noted that the first set of ten topics followed a slightly different procedure for stages (1) and (2). Only one list of potential topics was created in Stage 1.1, so Stage 1.2 (Consolidation of Lists) was only vacuously followed. There was also no randomization after public voting (Stage 1.7) and no scripted scraping searches (Stage 2.3), as all searches were manually conducted using specific strings. A different search engine was also used: SciVerse Hub, an aggreator that includes SciVerse Scopus, MEDLINE, PubMed Central, ArXiv.org, and many other databases of articles, books and presentations. The search strings for both rounds of metaanalysis, manual and scripted, are detailed in another appendix. 52
53 C Additional Results 53
54 54 Table 12: Descriptive Statistics: Standardized Narrowly Defined Outcomes Intervention Outcome # Neg sig papers # Insig papers # Pos sig papers # Papers Conditional cash transfers Attendance rate Conditional cash transfers Enrollment rate Conditional cash transfers Height Conditional cash transfers Heightforage Conditional cash transfers Labor force participation Conditional cash transfers Probability unpaid work Conditional cash transfers Test scores Conditional cash transfers Unpaid labor Conditional cash transfers Weightforage Conditional cash transfers Weightforheight HIV/AIDS Education Pregnancy rate HIV/AIDS Education Probability has multiple sex partners HIV/AIDS Education Used contraceptives Unconditional cash transfers Enrollment rate Unconditional cash transfers Test scores Unconditional cash transfers Weightforheight Insecticidetreated bed nets Malaria Contract teachers Test scores Deworming Attendance rate Deworming Birthweight Deworming Diarrhea incidence Deworming Height Deworming Heightforage Deworming Hemoglobin Deworming Malformations Deworming Midupper arm circumference Deworming Test scores Deworming Weight Deworming Weightforage Deworming Weightforheight Financial literacy Savings Improved stoves Chest pain Improved stoves Cough Improved stoves Difficulty breathing Improved stoves Excessive nasal secretion Irrigation Consumption Irrigation Total income Microfinance Assets Microfinance Consumption Microfinance Profits
55 55 Microfinance Savings Microfinance Total income Micro health insurance Enrollment rate Micronutrient supplementation Birthweight Micronutrient supplementation Body mass index Micronutrient supplementation Cough prevalence Micronutrient supplementation Diarrhea incidence Micronutrient supplementation Diarrhea prevalence Micronutrient supplementation Fever incidence Micronutrient supplementation Fever prevalence Micronutrient supplementation Height Micronutrient supplementation Heightforage Micronutrient supplementation Hemoglobin Micronutrient supplementation Malaria Micronutrient supplementation Midupper arm circumference Micronutrient supplementation Mortality rate Micronutrient supplementation Perinatal deaths Micronutrient supplementation Prevalence of anemia Micronutrient supplementation Stillbirths Micronutrient supplementation Stunted Micronutrient supplementation Test scores Micronutrient supplementation Triceps skinfold measurement Micronutrient supplementation Wasted Micronutrient supplementation Weight Micronutrient supplementation Weightforage Micronutrient supplementation Weightforheight Mobile phonebased reminders Appointment attendance rate Mobile phonebased reminders Treatment adherence Performance pay Test scores Rural electrification Enrollment rate Rural electrification Study time Rural electrification Total income Safe water storage Diarrhea incidence Scholarships Attendance rate Scholarships Enrollment rate Scholarships Test scores School meals Enrollment rate School meals Heightforage School meals Test scores Water treatment Diarrhea incidence Water treatment Diarrhea prevalence Women s empowerment programs Savings
56 56 Women s empowerment programs Total income Average
57 Table 13: Regression of Effect Size on Hierarchical Bayesian MetaAnalysis Results: RCTs Only (1) (2) (3) Effect size Effect size Effect size b/se b/se b/se Metaanalysis result 0.570*** *** (0.14) (0.33) (0.15) Observations R The metaanalysis result in the table above was created by synthesizing all but one observation within an interventionoutcome combination; that one observation left out is on the left hand side in the regression. All interventions and outcomes are included in the regression, clustering by interventionoutcome. Column (1) shows the results on the full set of RCTs; column (2) shows the results for those effect sizes pertaining to programs implemented by the government; column (3) shows the results for academic/ngoimplemented programs. 57
58 Table 14: Regression of Effect Size on Hierarchical Bayesian MetaAnalysis Results: Minimum Quality Score (1) (2) (3) Effect size Effect size Effect size b/se b/se b/se Metaanalysis result 0.534*** *** (0.13) (0.38) (0.16) Observations R The metaanalysis result in the table above was created by synthesizing all but one observation within an interventionoutcome combination; that one observation left out is on the left hand side in the regression. All interventions and outcomes are included in the regression, clustering by interventionoutcome. Column (1) shows the results on the full set of highquality papers according to the Jadad scale; column (2) shows the results for those effect sizes pertaining to programs implemented by the government; column (3) shows the results for academic/ngoimplemented programs. 58
59 Table 15: Regression of Effect Size on Hierarchical Bayesian MetaAnalysis Results: Minimum Four Papers per InterventionOutcome (1) (2) (3) Effect size Effect size Effect size b/se b/se b/se Metaanalysis result 0.682*** *** (0.12) (0.45) (0.14) Observations R The metaanalysis result in the table above was created by synthesizing all but one observation within an interventionoutcome combination; that one observation left out is on the left hand side in the regression. All interventions and outcomes are included in the regression, clustering by interventionoutcome. Column (1) shows the results for all the interventionoutcome combinations covered by at least four papers; column (2) shows the results for those effect sizes pertaining to programs implemented by the government; column (3) shows the results for academic/ngoimplemented programs. 59
60 Table 16: Regression of Effect Size on Hierarchical Bayesian MetaAnalysis Results: Including All Observations (1) (2) (3) Effect size Effect size Effect size b/se b/se b/se Metaanalysis result 0.580*** 0.466*** 0.708*** (0.09) (0.08) (0.14) Observations R The metaanalysis result in the table above was created by synthesizing all but one observation within an interventionoutcome combination; that one observation left out is on the left hand side in the regression. All interventions and outcomes are included in the regression, clustering by interventionoutcome. Column (1) shows the results on the full data set; column (2) shows the results for those effect sizes pertaining to programs implemented by the government; column (3) shows the results for academic/ngoimplemented programs. 60
61 Table 17: Regression of Effect Size on Fixed Effect MetaAnalysis Results (1) (2) (3) Effect size Effect size Effect size b/se b/se b/se Metaanalysis result 0.657*** 0.543*** 0.814*** (0.08) (0.05) (0.07) Observations R The metaanalysis result in the table above was created by synthesizing all but one observation within an interventionoutcome combination in a fixed effect metaanalysis; that one observation left out is on the left hand side in the regression. All interventions and outcomes are included in the regression, clustering by interventionoutcome. Column (1) shows the results on the full data set; column (2) shows the results for those effect sizes pertaining to programs implemented by the government; column (3) shows the results for academic/ngoimplemented programs. 61
62 Table 18: Regression of Effect Size on Hierarchical Bayesian MetaAnalysis Results: Published vs. Unpublished Papers (1) (2) Published Unpublished b/se b/se Metaanalysis result 0.600*** (0.14) (0.30) Observations R The metaanalysis result in the table above was created by synthesizing all but one observation within an interventionoutcome combination; that one observation left out is on the left hand side in the regression. All interventions and outcomes are included in the regression, clustering by interventionoutcome. Column (1) shows the results using the set of papers that were published; column (2) shows the results using only unpublished or working papers. 62
63 Table 19: Caliper Tests for Political Science, Reproduced from Gerber and Malhotra (2008a) for Comparison Over Caliper Under Caliper pvalue A. APSR Vol % Caliper ă % Caliper ă % Caliper ă0.001 Vol % Caliper ă % Caliper ă % Caliper ă0.001 Vol % Caliper % Caliper % Caliper B. AJPS Vol % Caliper ă % Caliper ă % Caliper ă0.001 Vol % Caliper ă % Caliper % Caliper Vol % Caliper % Caliper ă % Caliper ă
64 Table 20: Regression of Studies Absolute Percent Difference from the WithinIntervention Outcome Mean on Chronological Order (1) (2) (3) (4) ă500% ă1000% ă1500% ă2000% b/se b/se b/se b/se Chronological order (0.21) (0.47) (0.61) (0.67) Observations R Each column restricts attention to a set of results a given maximum percentage away from the mean result in an interventionoutcome combination. 64
65 D Simulations I show that the distribution of standardized effect sizes is recoverable from a range of biases, including selection bias (dropping insignificant results), datapeeking (adding data until getting the results one wants), cherrypicking dependent variables (choosing dependent variables so as to get significance), and selectively excluding outliers. I use the same specifications (fixed effects vs. random effects, etc.) as Simonsohn et al. (2014). They take a vector of estimated effect sizes and return a scalar unbiased estimate of the effect size; I focus on the distribution of effect sizes. Each measures the PRESS statistic, defined as: nÿ PRESS py i py i q 2 (17) i 1 where p Y i is the predicted value for the effect size Y i based on all observations except i. Here, p Yi represents the leaveoneout mean effect size. Code is available at 65
These are organized into two groups, corresponding to each of AidGrade s two rounds of metaanalysis.
Appendices A.2 Continuation of Appendices from Paper A.2.1 TopicSpecific Search Strings and Inclusion Criteria These are organized into two groups, corresponding to each of AidGrade s two rounds of metaanalysis.
More informationFixedEffect Versus RandomEffects Models
CHAPTER 13 FixedEffect Versus RandomEffects Models Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationSystematic Reviews and Metaanalyses
Systematic Reviews and Metaanalyses Introduction A systematic review (also called an overview) attempts to summarize the scientific evidence related to treatment, causation, diagnosis, or prognosis of
More informationNorthumberland Knowledge
Northumberland Knowledge Know Guide How to Analyse Data  November 2012  This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More informationSchools Valueadded Information System Technical Manual
Schools Valueadded Information System Technical Manual Quality Assurance & Schoolbased Support Division Education Bureau 2015 Contents Unit 1 Overview... 1 Unit 2 The Concept of VA... 2 Unit 3 Control
More information5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
More information1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2
PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand
More informationCredit Card Market Study Interim Report: Annex 4 Switching Analysis
MS14/6.2: Annex 4 Market Study Interim Report: Annex 4 November 2015 This annex describes data analysis we carried out to improve our understanding of switching and shopping around behaviour in the UK
More informationNumerical Summarization of Data OPRE 6301
Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting
More informationAnnex 6 BEST PRACTICE EXAMPLES FOCUSING ON SAMPLE SIZE AND RELIABILITY CALCULATIONS AND SAMPLING FOR VALIDATION/VERIFICATION. (Version 01.
Page 1 BEST PRACTICE EXAMPLES FOCUSING ON SAMPLE SIZE AND RELIABILITY CALCULATIONS AND SAMPLING FOR VALIDATION/VERIFICATION (Version 01.1) I. Introduction 1. The clean development mechanism (CDM) Executive
More informationMathematics. Probability and Statistics Curriculum Guide. Revised 2010
Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20thcentury statistics dealt with maximum likelihood
More informationExample G Cost of construction of nuclear power plants
1 Example G Cost of construction of nuclear power plants Description of data Table G.1 gives data, reproduced by permission of the Rand Corporation, from a report (Mooz, 1978) on 32 light water reactor
More informationChapter 5: Analysis of The National Education Longitudinal Study (NELS:88)
Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in
More informationCOMMON CORE STATE STANDARDS FOR
COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationExercise 1.12 (Pg. 2223)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
More informationInflation. Chapter 8. 8.1 Money Supply and Demand
Chapter 8 Inflation This chapter examines the causes and consequences of inflation. Sections 8.1 and 8.2 relate inflation to money supply and demand. Although the presentation differs somewhat from that
More informationA Review of Cross Sectional Regression for Financial Data You should already know this material from previous study
A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study But I will offer a review, with a focus on issues which arise in finance 1 TYPES OF FINANCIAL
More informationThe LifeCycle Motive and Money Demand: Further Evidence. Abstract
The LifeCycle Motive and Money Demand: Further Evidence Jan Tin Commerce Department Abstract This study takes a closer look at the relationship between money demand and the lifecycle motive using panel
More informationRandomized Evaluations of Interventions in Social Service Delivery
Randomized Evaluations of Interventions in Social Service Delivery By Esther Duflo, Rachel Glennerster, and Michael Kremer What is the most effective way to increase girls participation in school? How
More informationMethods for Metaanalysis in Medical Research
Methods for Metaanalysis in Medical Research Alex J. Sutton University of Leicester, UK Keith R. Abrams University of Leicester, UK David R. Jones University of Leicester, UK Trevor A. Sheldon University
More informationWeb appendix: Supplementary material. Appendix 1 (online): Medline search strategy
Web appendix: Supplementary material Appendix 1 (online): Medline search strategy exp Venous Thrombosis/ Deep vein thrombosis.mp. Pulmonary embolism.mp. or exp Pulmonary Embolism/ recurrent venous thromboembolism.mp.
More informationGood luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:
Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours
More informationMINITAB ASSISTANT WHITE PAPER
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. OneWay
More informationRecall this chart that showed how most of our course would be organized:
Chapter 4 OneWay ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationProblem of the Month Through the Grapevine
The Problems of the Month (POM) are used in a variety of ways to promote problem solving and to foster the first standard of mathematical practice from the Common Core State Standards: Make sense of problems
More informationMgmt 469. Model Specification: Choosing the Right Variables for the Right Hand Side
Mgmt 469 Model Specification: Choosing the Right Variables for the Right Hand Side Even if you have only a handful of predictor variables to choose from, there are infinitely many ways to specify the right
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationAP Statistics 2011 Scoring Guidelines
AP Statistics 2011 Scoring Guidelines The College Board The College Board is a notforprofit membership association whose mission is to connect students to college success and opportunity. Founded in
More informationHandling attrition and nonresponse in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 6372 Handling attrition and nonresponse in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
More informationIntroduction to time series analysis
Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples
More informationGMAC. Executive Education: Predicting Student Success in 22 Executive MBA Programs
GMAC Executive Education: Predicting Student Success in 22 Executive MBA Programs Kara M. Owens GMAC Research Reports RR0702 February 1, 2007 Abstract This study examined common admission requirements
More informationSouth Carolina College and CareerReady (SCCCR) Probability and Statistics
South Carolina College and CareerReady (SCCCR) Probability and Statistics South Carolina College and CareerReady Mathematical Process Standards The South Carolina College and CareerReady (SCCCR)
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationSENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons AttributionNonCommercialShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationPredicting the Performance of a First Year Graduate Student
Predicting the Performance of a First Year Graduate Student Luís Francisco Aguiar Universidade do Minho  NIPE Abstract In this paper, I analyse, statistically, if GRE scores are a good predictor of the
More informationOUTLINE OF PRINCIPLES OF IMPACT EVALUATION
OUTLINE OF PRINCIPLES OF IMPACT EVALUATION PART I KEY CONCEPTS Definition Impact evaluation is an assessment of how the intervention being evaluated affects outcomes, whether these effects are intended
More informationPoverty Assessment Tool Accuracy Submission USAID/IRIS Tool for Peru Submitted: September 15, 2011
Poverty Assessment Tool Submission USAID/IRIS Tool for Peru Submitted: September 15, 2011 The following report is divided into five sections. Section 1 describes the data used to create the Poverty Assessment
More informationDo Supplemental Online Recorded Lectures Help Students Learn Microeconomics?*
Do Supplemental Online Recorded Lectures Help Students Learn Microeconomics?* Jennjou Chen and TsuiFang Lin Abstract With the increasing popularity of information technology in higher education, it has
More informationAge to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA
Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA Introduction A common question faced by many actuaries when selecting loss development factors is whether to base the selected
More informationKeep It Simple: Easy Ways To Estimate Choice Models For Single Consumers
Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Christine Ebling, University of Technology Sydney, christine.ebling@uts.edu.au Bart Frischknecht, University of Technology Sydney,
More informationWhat is the purpose of this document? What is in the document? How do I send Feedback?
This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Statistics
More informationGMAC. Predicting Success in Graduate Management Doctoral Programs
GMAC Predicting Success in Graduate Management Doctoral Programs Kara O. Siegert GMAC Research Reports RR0710 July 12, 2007 Abstract An integral part of the test evaluation and improvement process involves
More informationAn analysis of the 2003 HEFCE national student survey pilot data.
An analysis of the 2003 HEFCE national student survey pilot data. by Harvey Goldstein Institute of Education, University of London h.goldstein@ioe.ac.uk Abstract The summary report produced from the first
More informationThe replication of empirical research is a critical
RESEARCH TECHNICAL COMMENT PSYCHOLOGY Comment on Estimating the reproducibility of psychological science Daniel T. Gilbert, 1 * Gary King, 1 Stephen Pettigrew, 1 Timothy D. Wilson 2 A paper from the Open
More informationThe Interaction of Workforce Development Programs and Unemployment Compensation by Individuals with Disabilities in Washington State
Number 6 January 2011 June 2011 The Interaction of Workforce Development Programs and Unemployment Compensation by Individuals with Disabilities in Washington State by Kevin Hollenbeck Introduction The
More informationModels for Longitudinal and Clustered Data
Models for Longitudinal and Clustered Data Germán Rodríguez December 9, 2008, revised December 6, 2012 1 Introduction The most important assumption we have made in this course is that the observations
More informationWindowsBased MetaAnalysis Software. Package. Version 2.0
1 WindowsBased MetaAnalysis Software Package Version 2.0 The HunterSchmidt MetaAnalysis Programs Package includes six programs that implement all basic types of HunterSchmidt psychometric metaanalysis
More informationOBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments
More informationAsymmetry and the Cost of Capital
Asymmetry and the Cost of Capital Javier García Sánchez, IAE Business School Lorenzo Preve, IAE Business School Virginia Sarria Allende, IAE Business School Abstract The expected cost of capital is a crucial
More informationAP Statistics 2002 Scoring Guidelines
AP Statistics 2002 Scoring Guidelines The materials included in these files are intended for use by AP teachers for course and exam preparation in the classroom; permission for any other use must be sought
More informationTechnology StepbyStep Using StatCrunch
Technology StepbyStep Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate
More informationDegree Outcomes for University of Reading Students
Report 1 Degree Outcomes for University of Reading Students Summary report derived from Jewell, Sarah (2008) Human Capital Acquisition and Labour Market Outcomes in UK Higher Education University of Reading
More informationAP Statistics 2001 Solutions and Scoring Guidelines
AP Statistics 2001 Solutions and Scoring Guidelines The materials included in these files are intended for noncommercial use by AP teachers for course and exam preparation; permission for any other use
More informationPremaster Statistics Tutorial 4 Full solutions
Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for
More informationPlanning sample size for randomized evaluations
TRANSLATING RESEARCH INTO ACTION Planning sample size for randomized evaluations Simone Schaner Dartmouth College povertyactionlab.org 1 Course Overview Why evaluate? What is evaluation? Outcomes, indicators
More informationMissing data in randomized controlled trials (RCTs) can
EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled
More informationExploratory Data Analysis
Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
More informationFORECASTING DEPOSIT GROWTH: Forecasting BIF and SAIF Assessable and Insured Deposits
Technical Paper Series Congressional Budget Office Washington, DC FORECASTING DEPOSIT GROWTH: Forecasting BIF and SAIF Assessable and Insured Deposits Albert D. Metz Microeconomic and Financial Studies
More information4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"
Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses
More informationConcept Note: Impact evaluation of vocational and entrepreneurship training in Nepal
A BRIEF CONCEPT NOTE ON Assessing the impact of vocational and entrepreneurship training implemented by Employment Fund Project/ HELVETAS Swiss Intercooperation in Nepal Background The Employment Fund
More informationThe Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities
The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities Elizabeth GarrettMayer, PhD Assistant Professor Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University 1
More informationAcademic Performance of Native and Transfer Students
Academic Performance of Native and Transfer Students Report 201002 John M. Krieg April, 2010 Office of Survey Research Western Washington University Table of Contents I. Acknowledgements 3 II. Executive
More informationComparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models
Comparing return to work outcomes between vocational rehabilitation providers after adjusting for case mix using statistical models Prepared by Jim Gaetjens Presented to the Institute of Actuaries of Australia
More informationHow to Win the Stock Market Game
How to Win the Stock Market Game 1 Developing ShortTerm Stock Trading Strategies by Vladimir Daragan PART 1 Table of Contents 1. Introduction 2. Comparison of trading strategies 3. Return per trade 4.
More informationDOCUMENT REVIEWED: AUTHOR: PUBLISHER/THINK TANK: DOCUMENT RELEASE DATE: September 2009 REVIEW DATE: November 12, 2009 REVIEWER: EMAIL ADDRESS:
DOCUMENT REVIEWED: AUTHOR: PUBLISHER/THINK TANK: DOCUMENT RELEASE DATE: September 2009 REVIEW DATE: November 12, 2009 REVIEWER: EMAIL ADDRESS: How New York City s Charter Schools Affect Achievement. Caroline
More informationWeek 1. Exploratory Data Analysis
Week 1 Exploratory Data Analysis Practicalities This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics. Two lectures and one seminar/tutorial per week. Exam
More informationRelating the ACT Indicator Understanding Complex Texts to College Course Grades
ACT Research & Policy Technical Brief 2016 Relating the ACT Indicator Understanding Complex Texts to College Course Grades Jeff Allen, PhD; Brad Bolender; Yu Fang, PhD; Dongmei Li, PhD; and Tony Thompson,
More informationCHARTER SCHOOL PERFORMANCE IN PENNSYLVANIA. credo.stanford.edu
CHARTER SCHOOL PERFORMANCE IN PENNSYLVANIA credo.stanford.edu April 2011 TABLE OF CONTENTS INTRODUCTION... 3 DISTRIBUTION OF CHARTER SCHOOL PERFORMANCE IN PENNSYLVANIA... 7 CHARTER SCHOOL IMPACT BY DELIVERY
More informationAnalyzing and interpreting data Evaluation resources from Wilder Research
Wilder Research Analyzing and interpreting data Evaluation resources from Wilder Research Once data are collected, the next step is to analyze the data. A plan for analyzing your data should be developed
More informationLocal outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
More informationStrategies for Promoting Gatekeeper Course Success Among Students Needing Remediation: Research Report for the Virginia Community College System
Strategies for Promoting Gatekeeper Course Success Among Students Needing Remediation: Research Report for the Virginia Community College System Josipa Roksa Davis Jenkins Shanna Smith Jaggars Matthew
More informationA C T R esearcli R e p o rt S eries 2 0 0 5. Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen.
A C T R esearcli R e p o rt S eries 2 0 0 5 Using ACT Assessment Scores to Set Benchmarks for College Readiness IJeff Allen Jim Sconing ACT August 2005 For additional copies write: ACT Research Report
More informationTesting Group Differences using Ttests, ANOVA, and Nonparametric Measures
Testing Group Differences using Ttests, ANOVA, and Nonparametric Measures Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 354870348 Phone:
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationSTATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI
STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members
More informationAP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
More informationSimple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
More informationTutorial 5: Hypothesis Testing
Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrclmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................
More informationA Disaggregated Analysis of the Long Run Impact of Vocational Qualifications
BIS RESEARCH PAPER NUMBER 106 A Disaggregated Analysis of the Long Run Impact of Vocational Qualifications FEBRUARY 2013 1 Authors: Gavan Conlon and Pietro Patrignani London Economics The views expressed
More informationhttp://www.jstor.org This content downloaded on Tue, 19 Feb 2013 17:28:43 PM All use subject to JSTOR Terms and Conditions
A Significance Test for Time Series Analysis Author(s): W. Allen Wallis and Geoffrey H. Moore Reviewed work(s): Source: Journal of the American Statistical Association, Vol. 36, No. 215 (Sep., 1941), pp.
More informationCALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 15 scale to 0100 scores When you look at your report, you will notice that the scores are reported on a 0100 scale, even though respondents
More informationTesting Hypotheses About Proportions
Chapter 11 Testing Hypotheses About Proportions Hypothesis testing method: uses data from a sample to judge whether or not a statement about a population may be true. Steps in Any Hypothesis Test 1. Determine
More informationTechnical note I: Comparing measures of hospital markets in England across market definitions, measures of concentration and products
Technical note I: Comparing measures of hospital markets in England across market definitions, measures of concentration and products 1. Introduction This document explores how a range of measures of the
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationPredicting the probabilities of participation in formal adult education in Hungary
Péter Róbert Predicting the probabilities of participation in formal adult education in Hungary SP2 National Report Status: Version 24.08.2010. 1. Introduction and motivation Participation rate in formal
More informationUNDERSTANDING THE TWOWAY ANOVA
UNDERSTANDING THE e have seen how the oneway ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
More informationECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2
University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages
More information5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
More informationStatistics 151 Practice Midterm 1 Mike Kowalski
Statistics 151 Practice Midterm 1 Mike Kowalski Statistics 151 Practice Midterm 1 Multiple Choice (50 minutes) Instructions: 1. This is a closed book exam. 2. You may use the STAT 151 formula sheets and
More informationMore details on the inputs, functionality, and output can be found below.
Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a twoarmed trial comparing
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationTeacher preparation program student performance data models: Six core design principles
Teacher preparation program student performance models: Six core design principles Just as the evaluation of teachers is evolving into a multifaceted assessment, so too is the evaluation of teacher preparation
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More information