• No results found

How Much Can We Generalize? Measuring the External Validity of Impact Evaluations

N/A
N/A
Protected

Academic year: 2021

Share "How Much Can We Generalize? Measuring the External Validity of Impact Evaluations"

Copied!
61
0
0

Loading.... (view fulltext now)

Full text

(1)

How Much Can We Generalize? Measuring the External Validity of Impact Evaluations

Eva Vivalt

New York University

August 31, 2015

Abstract

Impact evaluations aim to predict the future, but they are rooted in particular contexts and to what extent they generalize is an open and important question.

I founded an organization to systematically collect and synthesize impact evalu- ation results on a wide variety of interventions in development. These data allow me to answer this and other questions for the first time using a large data set of studies. I consider several measures of generalizability, discuss the strengths and limitations of each metric, and provide benchmarks based on the data. I use the example of the effect of conditional cash transfers on enrollment rates to show how some of the heterogeneity can be modelled and the effect this can have on the generalizability measures. The predictive power of the model improves over time as more studies are completed. Finally, I show how researchers can estimate the generalizability of their own study using their own data, even when data from no comparable studies exist.

E-mail: eva.vivalt@nyu.edu. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal B´o, Hunt Allcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel, Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, Columbia University, New York University, the World Bank, Cornell University, Princeton University, the University of Toronto, the London School of Economics, the Australian National University, and the University of Ottawa, among others, and participants at the 2015 ASSA meeting and 2013 Association for Public Policy Analysis and Management Fall Research Conference for helpful comments. I am also grateful for the hard work put in by many at AidGrade over the duration of this project, including but not limited to Jeff Qiu, Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Mi Shen, Ning Zhang, Jennifer Ambrose, Naomi Crowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha Jalil, Risa Santoso and Catherine Razeto.

(2)

1 Introduction

In the last few years, impact evaluations have become extensively used in development economics research. Policymakers and donors typically fund impact evaluations precisely to figure out how effective a similar program would be in the future to guide their decisions on what course of action they should take. However, it is not yet clear how much we can extrapolate from past results or under which conditions. Further, there is some evidence that even a similar program, in a similar environment, can yield different results. For ex- ample, Bold et al. (2013) carry out an impact evaluation of a program to provide contract teachers in Kenya; this was a scaled-up version of an earlier program studied by Duflo, Du- pas and Kremer (2012). The earlier intervention studied by Duflo, Dupas and Kremer was implemented by an NGO, while Bold et al. compared implementation by an NGO and the government. While Duflo, Dupas and Kremer found positive effects, Bold et al. showed significant results only for the NGO-implemented group. The different findings in the same country for purportedly similar programs point to the substantial context-dependence of im- pact evaluation results. Knowing this context-dependence is crucial in order to understand what we can learn from any impact evaluation.

While the main reason to examine generalizability is to aid interpretation and improve predictions, it would also help to direct research attention to where it is most needed. If generalizability were higher in some areas, fewer papers would be needed to understand how people would behave in a similar situation; conversely, if there were topics or regions where generalizability was low, it would call for further study. With more information, researchers can better calibrate where to direct their attentions to generate new insights.

It is well-known that impact evaluations only happen in certain contexts. For example, Figure 1 shows a heat map of the geocoded impact evaluations in the data used in this paper overlaid by the distribution of World Bank projects (black dots). Both sets of data are geo- graphically clustered, and whether or not we can reasonably extrapolate from one to another depends on how much related heterogeneity there is in treatment effects. Allcott (forthcom-

(3)

Figure 1: Growth of Impact Evaluations and Location Relative to Programs

The figure on the left shows a heat map of the impact evaluations in AidGrade’s database overlaid by black dots indicating where the World Bank has done projects. While there are many other development

programs not done by the World Bank, this figure illustrates the great numbers and geographical

dispersion of development programs. The figure on the right plots the number of studies that came out in each year that are contained in each of three databases described in the text: 3ie’s title/abstract/keyword database of impact evaluations; J-PAL’s database of affiliated randomized controlled trials; and AidGrade’s database of impact evaluation results data.

Impact evaluations are still exponentially increasing in number and in terms of the re- sources devoted to them. The World Bank recently received a major grant from the UK aid agency DFID to expand its already large impact evaluation works; the Millennium Challenge Corporation has committed to conduct rigorous impact evaluations for 50% of its activities, with “some form of credible evaluation of impact” for every activity (Millennium Challenge Corporation, 2009); and the U.S. Agency for International Development is also increasingly invested in impact evaluations, coming out with a new policy in 2011 that directs 3% of program funds to evaluation.1

Yet while impact evaluations are still growing in development, a few thousand are al- ready complete. Figure 1 plots the explosion of RCTs that researchers affiliated with J-PAL, a center for development economics research, have completed each year; alongside are the number of development-related impact evaluations released that year according to 3ie, which keeps a directory of titles, abstracts, and other basic information on impact evaluations more broadly, including quasi-experimental designs; finally, the dashed line shows the number of papers that came out in each year that are included in AidGrade’s database of impact eval- uation results, which will be described shortly.

1While most of these are less rigorous “performance evaluations”, country mission leaders are supposed to identify at least one opportunity for impact evaluation for each development objective in their 3-5 year plans (USAID, 2011).

(4)

In short, while we do impact evaluation to figure out what will happen in the future, many issues have been raised about how well we can extrapolate from past impact evalua- tions, and despite the importance of the topic, previously we were unable to do little more than guess or examine the question in narrow settings as we did not have the data. Now we have the opportunity to address speculation, drawing on a large, unique dataset of impact evaluation results.

I founded a non-profit organization dedicated to gathering this data. That organization, AidGrade, seeks to systematically understand which programs work best where, a task that requires also knowing the limits of our knowledge. To date, AidGrade has conducted 20 meta-analyses and systematic reviews of different development programs.2 Data gathered through meta-analyses are the ideal data to answer the question of how much we can ex- trapolate from past results, and since data on these 20 topics were collected in the same way, coding the same outcomes and other variables, we can look across different types of programs to see if there are any more general trends. Currently, the data set contains 647 pa- pers on 210 narrowly-defined intervention-outcome combinations, with the greater database containing 15,021 estimates.

I define generalizability and discuss several metrics with which to measure it. Other disciplines have considered generalizability more, so I draw on the literature relating to meta-analysis, which has been most well-developed in medicine, as well as the psychometric literature on generalizability theory (Higgins and Thompson, 2002; Shavelson and Webb, 2006; Briggs and Wilson, 2007). The measures I discuss could also be used in conjunction with any model that seeks to explain variation in treatment effects (e.g. Dehejia, Pop-Eleches and Samii, 2015) to quantify the proportion of variation that such a model explains. Since some of the analyses will draw upon statistical methods not commonly used in economics, I will use the concrete example of conditional cash transfers (CCTs), which are relatively well-understood and on which many papers have been written, to elucidate the issues.

While this paper focuses on results for impact evaluations of development programs, this

(5)

2 Theory

2.1 Heterogeneous Treatment Effects

I model treatment effects as potentially depending on the context of the intervention.

Each impact evaluation is on a particular intervention and covers a number of outcomes.

The relationship between an outcome, the inputs that were part of the intervention, and the context of the study is complex. In the simplest model, we can imagine that context can be represented a “contextual variable”, C, such that:

Zj “ α ` βTj ` δCj ` γTjCj ` εj (1) where j indexes the individual, Z represents the value of an aggregate outcome such as

“enrollment rates”, T indicates being treated, and C represents a contextual variable, such as the type of agency that implemented the program.3

In this framework, a particular impact evaluation might explicitly estimate:

Zj “ α ` β1Tj` εj (2)

but, as Equation 1 can be re-written as Zj “ α ` pβ ` γCjqTj` δCj ` εj, what β1 is really capturing is the effect β1 “ β ` γC. When C varies, unobserved, in different contexts, the variance of β1 increases.

This is the simplest case. One can imagine that the true state of the world has “interac- tion effects all the way down”.

Interaction terms are often considered a second-order problem. However, that intuition could stem from the fact that we usually look for interaction terms within an already fairly homogeneous dataset - e.g. data from a single country, at a single point in time, on a par- ticularly selected sample.

Not all aspects of context need matter to an intervention’s outcomes. The set of con- textual variables can be divided into a critical set on which outcomes depend and an set on which they do not; I will ignore the latter. Further, the relationship between Z and C can vary by intervention or outcome. For example, school meals programs might have more of an effect on younger children, but scholarship programs could plausibly affect older children more. If one were to regress effect size on the contextual variable “age”, we would get differ- ent results depending on which intervention and outcome we were considering. Therefore,

3Z can equally well be thought of as the average individual outcome for an intervention. Throughout, I take high values for an outcome to represent a beneficial change unless otherwise noted; if an outcome represents a negative characteristic, like incidence of a disease, its sign will be flipped before analysis.

(6)

it will be important in this paper to look only at a restricted set of contextual variables which could plausibly work in a similar way across different interventions. Additional anal- ysis could profitably be done within some interventions, but this is outside the scope of this paper.

Generalizability will ultimately depend on the heterogeneity of treatment effects. The next section formally defines generalizability for use in this paper.

2.2 Generalizability: Definitions and Measurement

Definition 1 Generalizability is the ability to predict results accurately out of sample.

Definition 2 Local generalizability is the ability to predict results accurately in a particular out-of-sample group.

There are several ways to operationalize these definitions. The ability to predict results hinges both on the variability of the results and the proportion that can be explained. For example, if the overall variability in a set of results is high, this might not be as concerning if the proportion of variability that can be explained is also high.

It is straightforward to measure the variance in results. However, these statistics need to be benchmarked in order to know what is a “high” or “low” variance. One advantage of the large data set used in this paper is that I can use it to benchmark the results from different intervention-outcome combinations against each other. This is not the first paper to tentatively suggest a scale. Other rules of thumb have also been created in this manner, such as those used to consider the magnitude of effect sizes (0-0.2 SD = “small”, 0.2-0.5 = “medium”, ą 0.5 SD = “large”) (Cohen, 1988) or the measure of the impact of heterogeneity on meta-analysis results, I2 (0.25=“low”, 0.5=“medium”, 0.75=“high”) (Higgins et al., 2003). I can also compare across-paper variation to within-paper variation, with the idea that within-study variation should represent a lower bound to across-study

(7)

be in terms of a common unit, such as standard deviations4; 3) scale the standard deviation by the mean result, creating the coefficient of variation. The coefficient of variation represents the inverse of the signal-to-noise ratio, and as a unitless figure can be compared across intervention-outcome combinations with different natural units. It is not immune to criticism, however, particularly in that it may result in large values as the mean approaches zero.5

All the measures discussed so far focus on variation. However, if we could explain the variation, it would no longer worsen our ability to make predictions in a new setting, so long as we had all the necessary data from that setting, such as covariates, with which to extrapolate.

To explain variation, we need a model. The meta-analysis literature suggests two general types of models which can be parameterized in many ways: fixed-effect models and random-effects models.

Fixed-effect models assume there is one true effect of a particular program and all differences between studies can be attributed simply to sampling error. In other words:

Yi “ θ ` εi (3)

where Yi is the observed effect size of a particular study, θ is the true effect and εi is the error term.

Random-effects models do not make this assumption; the true effect could potentially vary from context to context. Here,

Yi “ θi` εi (4)

“ ¯θ ` ηi` εi (5)

where θi is the effect size for a particular study i, ¯θ is the mean true effect size, ηi is a particular study’s divergence from that mean true effect size, and εi is the error. Random- effects models are more plausible and they are necessary if we think there are heterogeneous treatment effects, so I use them in this paper. Random-effects models can also be modified by the addition of explanatory variables, at which point they are called mixed models; I will also use mixed models in this paper.

Sampling variance, varpYiiq, is denoted as σ2 and between-study variance, varpθiq, τ2.

4This can be problematic if the standard deviations themselves vary but is a common approach in the meta-analysis literature in lieu of a better option.

5This paper follows convention and reports the absolute value of the coefficient of variation wherever it appears.

(8)

This variation in observed effect sizes is then:

varpYiq “ τ2` σ2 (6)

and the proportion of the variation that is not sampling error is:

I2 “ τ2

τ2` σ2 (7)

The I2 is an established metric in the meta-analysis literature that helps determine whether a fixed or random effects model is more appropriate; the higher I2, the less plausible it is that sampling error drives all the variation in results. I2 is considered “low” at 0.25,

“medium” at 0.5, and “high” at 0.75 (Higgins et al., 2003).6

If we wanted to explain more of the variation, we could do moderator or mediator analysis, in which we examine how results vary with the characteristics of the study, characteristics of its sample, or details about the intervention and its implementation. A linear meta-regression is one way of accomplishing this goal, explicitly estimating:

Yi “ β0` ÿ

n

βnXn` ηi` εi

where Xn are explanatory variables. This is a mixed model and, upon estimating it, we can calculate several additional statistics: the amount of residual variation in Yi, after accounting for Xn, varRpYi´ pYiq, the coefficient of residual variation, CVRpYi´ pYiq, and the residual IR2. Further, we can examine the R2 of the meta-regression.

It should be noted that a linear meta-regression is only one way of modelling variation in Yi. The I2, for example, is analogous to the reliability coefficient of classical test theory or the generalizability coefficient of generalizability theory (a branch of psychometrics), both of which estimate the proportion of variation that is not error. In this literature, additional heterogeneity is usually modelled using ANOVA rather than meta-regression. Modelling

(9)

Table 1: Summary of heterogeneity measures Measure of variation Measure of proportion

of variation that is systematic

Measure makes use of explanatory variables

varpYiq X

varRpYi´ pYiq X X

CVpYiq X

CVRpYi´ pYiq X X

I2 X

IR2 X X

R2 X X

Table 2: Desirable properties of a measure of heterogeneity Does not depend

on the number of studies in a cell

Does not depend on the precision of individual es- timates

Does not depend on the estimates’

units

Does not depend on the mean re- sult in the cell

varpYiq X X X

varRpYi´ pYiq X X X

CVpYiq X X X

CVRpYi´ pYiq X X X

I2 X X X

IR2 X X X

R2 X X X X

A “cell” here refers to an intervention-outcome combination. The “precision” of an estimate refers to its standard error.

desirable properties of a measure of heterogeneity and which properties are possessed by each of the discussed indicators. Measuring heterogeneity using the variance of Yi requires the Yi to have comparable units. Using the coefficient of variation requires the assumption that the mean effect size is an appropriate measure with which to scale sd(Yi). The variance and coefficient of variation also do not have anything to say about the amount of heterogeneity that can be explained. Adding explanatory variables also has its limitations. In any model, we have no way to guarantee that we are indeed capturing all the relevant factors. While I2 has the nice property that it disaggregates sampling variance as a source of variation, estimating it depends on the weights applied to each study’s results and thus, in turn, on the sample sizes of the studies. The R2 has its own well-known caveats, such as that it can be artificially inflated by over-fitting.

(10)

Having discussed the different measures of generalizability I will use in this paper, I turn to describe how I will estimate the parameters of the random effects or mixed models.

2.3 Hierarchical Bayesian Analysis

This paper uses meta-analysis as a tool to synthesize evidence.

As a quick review, there are many steps in a meta-analysis, most of which have to do with the selection of the constituent papers. The search and screening of papers will be described in the data section; here, I merely discuss the theory behind how meta-analyses combine results and estimate the parameters σ2 and τ2 that will be used to generate I2.

I begin by presenting the random effects model, followed by the related strategy to estimate a mixed model.

2.4 Estimating a Random Effects Model

To build a hierarchical Bayesian random effects model, I first assume the data are nor- mally distributed:

Yiji „ N pθi, σ2q (8)

where j indexes the individuals in the study. I do not have individual-level data, but instead can use sufficient statistics:

Yii „ N pθi, σi2q (9)

where Yi is the sample mean and σ2i the sample variance. This provides the likelihood for θi. I also need a prior for θi. I assume between-study normality:

θi „ N pµ, τ2q (10)

where µ and τ are unknown hyperparameters.

Conditioning on the distribution of the data, given by Equation 9, I get a posterior:

(11)

and as the Yi are estimates of µ with variance pσ2i ` τ2q, obtain:

µ|τ, Y „ N pˆµ, Vµq (13)

where

µ “ˆ ř

i Yi

σ2i2

ř

i 1 σ2i2

, Vµ“ ÿ

i

1

1 σ2i2

(14)

For τ , note that ppτ |Y q “ ppµ,τ |Y qppµ|τ,Y q. The denominator follows from Equation 12; for the numerator, we can observe that ppµ, τ |Y q is proportional to ppµ, τ qppY |µ, τ q, and we know the marginal distribution of Yi|µ, τ :

Yi|µ, τ „ N pµ, σi2` τ2q (15) I use a uniform prior for τ , following Gelman et al. (2005). This yields the posterior for the numerator:

ppµ, τ |Y q9ppµ, τ qź

i

N pYi|µ, σi2` τ2q (16) Putting together all the pieces in reverse order, I first simulate τ , then generate ppτ |Y q using τ , followed by µ and finally θi.

2.5 Estimating a Mixed Model

The strategy here is similar. Appendix D contains a derivation.

3 Data

This paper uses a database of impact evaluation results collected by AidGrade, a U.S.

non-profit research institute that I founded in 2012. AidGrade focuses on gathering the results of impact evaluations and analyzing the data, including through meta-analysis. Its data on impact evaluation results were collected in the course of its meta-analyses from 2012-2014 (AidGrade, 2015).

AidGrade’s meta-analyses follow the standard stages: (1) topic selection; (2) a search for relevant papers; (3) screening of papers; (4) data extraction; and (5) data analysis. In addition, it pays attention to (6) dissemination and (7) updating of results. Here, I will discuss the selection of papers (stages 1-3) and the data extraction protocol (stage 4); more detail is provided in Appendix B.

(12)

3.1 Selection of Papers

The interventions that were selected for meta-analysis were selected largely on the basis of there being a sufficient number of studies on that topic. Five AidGrade staff members each independently made a preliminary list of interventions for examination; the lists were then combined and searches done for each topic to determine if there were likely to be enough impact evaluations for a meta-analysis. The remaining list was voted on by the general public online and partially randomized. Appendix B provides further detail.

A comprehensive literature search was done using a mix of the search aggregators Sci- Verse, Google Scholar, and EBSCO/PubMed. The online databases of J-PAL, IPA, CEGA and 3ie were also searched for completeness. Finally, the references of any existing system- atic reviews or meta-analyses were collected.

Any impact evaluation which appeared to be on the intervention in question was included, barring those in developed countries.7 Any paper that tried to consider the counterfactual was considered an impact evaluation. Both published papers and working papers were in- cluded. The search and screening criteria were deliberately broad. There is not enough room to include the full text of the search terms and inclusion criteria for all 20 topics in this paper, but these are available in an online appendix as detailed in Appendix A.

3.2 Data Extraction

The subset of the data on which I am focusing is based on those papers that passed all screening stages in the meta-analyses. Again, the search and screening criteria were very broad and, after passing the full text screening, the vast majority of papers that were later excluded were excluded merely because they had no outcome variables in common or did not provide adequate data (for example, not providing data that could be used to calculate the standard error of an estimate, or for a variety of other quirky reasons, such as displaying results only graphically). The small overlap of outcome variables is a surprising and notable

(13)

uations than J-PAL has published, concentrated in these 20 topics. Unfortunately, only 318 of these papers both overlapped in outcomes with another paper and were able to be stan- dardized and thus included in the main results which rely on intervention-outcome groups.

Outcomes were defined under several rules of varying specificity, as will be discussed shortly.

Table 3: List of Development Programs Covered

2012 2013

Conditional cash transfers Contract teachers

Deworming Financial literacy training Improved stoves HIV education

Insecticide-treated bed nets Irrigation

Microfinance Micro health insurance

Safe water storage Micronutrient supplementation Scholarships Mobile phone-based reminders

School meals Performance pay

Unconditional cash transfers Rural electrification

Water treatment Women’s empowerment programs

73 variables were coded for each paper. Additional topic-specific variables were coded for some sets of papers, such as the median and mean loan size for microfinance programs. This paper focuses on the variables held in common across the different topics. These include which method was used; if randomized, whether it was randomized by cluster; whether it was blinded; where it was (village, province, country - these were later geocoded in a sepa- rate process); what kind of institution carried out the implementation; characteristics of the population; and the duration of the intervention from the baseline to the midline or endline results, among others. A full set of variables and the coding manual is available online, as detailed in Appendix A.

As this paper pays particular attention to the program implementer, it is worth discussing how this variable was coded in more detail. There were several types of implementers that could be coded: governments, NGOs, private sector firms, and academics. There was also a code for “other” (primarily collaborations) or “unclear”. The vast majority of studies were implemented by academic research teams and NGOs. This paper considers NGOs and aca- demic research teams together because it turned out to be practically difficult to distinguish between them in the studies, especially as the passive voice was frequently used (e.g. “X was done” without noting who did it). There were only a few private sector firms involved, so they are considered with the “other” category in this paper.

Studies tend to report results for multiple specifications. AidGrade focused on those

(14)

results least likely to have been influenced by author choices: those with the fewest con- trols, apart from fixed effects. Where a study reported results using different methodologies, coders were instructed to collect the findings obtained under the authors’ preferred method- ology; where the preferred methodology was unclear, coders were advised to follow the internal preference ordering of prioritizing randomized controlled trials, followed by regres- sion discontinuity designs and differences-in-differences, followed by matching, and to collect multiple sets of results when they were unclear on which to include. Where results were presented separately for multiple subgroups, coders were similarly advised to err on the side of caution and to collect both the aggregate results and results by subgroup except where the author appeared to be only including a subgroup because results were significant within that subgroup. For example, if an author reported results for children aged 8-15 and then also presented results for children aged 12-13, only the aggregate results would be recorded, but if the author presented results for children aged 8-9, 10-11, 12-13, and 14-15, all subgroups would be coded as well as the aggregate result when presented. Authors only rarely reported isolated subgroups, so this was not a major issue in practice.

When considering the variation of effect sizes within a group of papers, the definition of the group is clearly critical. Two different rules were initially used to define outcomes: a strict rule, under which only identical outcome variables are considered alike, and a loose rule, under which similar but distinct outcomes are grouped into clusters.

The precise coding rules were as follows:

1. We consider outcome A to be the same as outcome B under the “strict rule” if out- comes A and B measure the exact same quality. Different units may be used, pending conversion. The outcomes may cover different timespans (e.g. encompassing both outcomes over “the last month” and “the last week”). They may also cover different populations (e.g. children or adults). Examples: height; attendance rates.

2. We consider outcome A to be the same as outcome B under the “loose rule” if they

(15)

those papers on the same program, such as the various evaluations of PROGRESA.

After coding, the data were then standardized to make results easier to interpret and so as not to overly weight those outcomes with larger scales. The typical way to compare results across different outcomes is by using the standardized mean difference, defined as:

SM D “ µ1´ µ2

σp

where µ1 is the mean outcome in the treatment group, µ2 is the mean outcome in the control group, and σp is the pooled standard deviation. When data are not available to calculate the pooled standard deviation, it can be approximated by the standard deviation of the depen- dent variable for the entire distribution of observations or as the standard deviation in the control group (Glass, 1976). If that is not available either, due to standard deviations not having been reported in the original papers, one can use the typical standard deviation for the intervention-outcome. I follow this approach to calculate the standardized mean differ- ence, which is then used as the effect size measure for the rest of the paper unless otherwise noted.

This paper uses the “strict” outcomes where available, but the “loose” outcomes where that would keep more data. For papers which were follow-ups of the same study, the most recent results were used for each outcome.

Finally, one paper appeared to misreport results, suggesting implausibly low values and standard deviations for hemoglobin. These results were excluded and the paper’s correspond- ing author contacted. Excluding this paper’s results, effect sizes range between -1.5 and 1.8 SD, with an interquartile range of 0 to 0.2 SD. So as to mitigate sensitivity to individual results, especially with the small number of papers in some intervention-outcome groups, I restrict attention to those standardized effect sizes less than 2 SD away from 0, dropping 1 additional observation. I report main results including this observation in the Appendix.

3.3 Data Description

Figure 2 summarizes the distribution of studies covering the interventions and outcomes considered in this paper that can be standardized. Attention will typically be limited to those intervention-outcome combinations on which we have data for at least three papers.

Table 13 in Appendix C lists the interventions and outcomes and describes their results in a bit more detail, providing the distribution of significant and insignificant results. It should be emphasized that the number of negative and significant, insignificant, and positive and significant results per intervention-outcome combination only provide ambiguous evidence of the typical efficacy of a particular type of intervention. Simply tallying the numbers in

(16)

each category is known as “vote counting” and can yield misleading results if, for example, some studies are underpowered.

Table 4 further summarizes the distribution of papers across interventions and highlights the fact that papers exhibit very little overlap in terms of outcomes studied. This is consistent with the story of researchers each wanting to publish one of the first papers on a topic. Vivalt (2015a) finds that later papers on the same intervention-outcome combination more often remain as working papers.

A note must be made about combining data. When conducting a meta-analysis, the Cochrane Handbook for Systematic Reviews of Interventions recommends collapsing the data to one observation per intervention-outcome-paper, and I do this for generating the within intervention-outcome meta-analyses (Higgins and Green, 2011). Where results had been reported for multiple subgroups (e.g. women and men), I aggregated them as in the Cochrane Handbook’s Table 7.7.a. Where results were reported for multiple time periods (e.g. 6 months after the intervention and 12 months after the intervention), I used the most comparable time periods across papers. When combining across multiple outcomes, which has limited use but will come up later in the paper, I used the formulae from Borenstein et al. (2009), Chapter 24.

(17)

Figure 2: Within-Intervention-Outcome Number of Papers

(18)

Table 4: Descriptive Statistics: Distribution of Narrow Outcomes

Intervention Number of Mean papers Max papers

outcomes per outcome per outcome

Conditional cash transfers 10 21 37

Contract teachers 1 3 3

Deworming 12 13 18

Financial literacy 1 5 5

HIV/AIDS Education 3 8 10

Improved stoves 4 2 2

Insecticide-treated bed nets 1 9 9

Irrigation 2 2 2

Micro health insurance 1 2 2

Microfinance 5 4 5

Micronutrient supplementation 23 27 47

Mobile phone-based reminders 2 4 5

Performance pay 1 3 3

Rural electrification 3 3 3

Safe water storage 1 2 2

Scholarships 3 4 5

School meals 3 3 3

Unconditional cash transfers 3 9 11

Water treatment 2 5 6

Women’s empowerment programs 2 2 2

Average 4.2 6.5 9.0

(19)

4 Generalizability of Impact Evaluation Results

4.1 Without Modelling Heterogeneity

Table 5 presents results for the metrics described earlier, within intervention-outcome combinations. All Yi were converted to be in terms of standard deviations to put them on a common scale before statistics were calculated, with the aforementioned caveats.

The different measures yield quite different results, as they measure different things, as previously discussed. The coefficient of variation depends heavily on the mean; the I2, on the precision of the underlying estimates.

Table 5: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes

Intervention Outcome var(Yi) CV(Yi) I2

Microfinance Assets 0.000 5.508 1.000

Rural Electrification Enrollment rate 0.001 0.129 0.768

Micronutrients Cough prevalence 0.001 1.648 0.995

Microfinance Total income 0.001 0.989 0.999

Microfinance Savings 0.002 1.773 1.000

Financial Literacy Savings 0.004 5.472 0.891

Microfinance Profits 0.005 5.448 1.000

Contract Teachers Test scores 0.005 0.403 1.000

Performance Pay Test scores 0.006 0.608 1.000

Micronutrients Body mass index 0.007 0.675 1.000

Conditional Cash Transfers Unpaid labor 0.009 0.920 0.797

Micronutrients Weight-for-age 0.009 1.941 0.884

Micronutrients Weight-for-height 0.010 2.148 0.677

Micronutrients Birthweight 0.010 0.981 0.827

Micronutrients Height-for-age 0.012 2.467 0.942

Conditional Cash Transfers Test scores 0.013 1.866 0.995

Deworming Hemoglobin 0.015 3.377 0.919

Micronutrients Mid-upper arm circumference 0.015 2.078 0.502 Conditional Cash Transfers Enrollment rate 0.015 0.831 1.000 Unconditional Cash Transfers Enrollment rate 0.016 1.093 0.998

Water Treatment Diarrhea prevalence 0.020 0.966 1.000

SMS Reminders Treatment adherence 0.022 1.672 0.780

Conditional Cash Transfers Labor force participation 0.023 1.628 0.424

School Meals Test scores 0.023 1.288 0.559

Micronutrients Height 0.023 4.369 0.826

Micronutrients Mortality rate 0.025 2.880 0.201

Micronutrients Stunted 0.025 1.110 0.262

Bed Nets Malaria 0.029 0.497 0.880

Conditional Cash Transfers Attendance rate 0.030 0.523 0.939

(20)

Micronutrients Weight 0.034 2.696 0.549 HIV/AIDS Education Used contraceptives 0.036 3.117 0.490

Micronutrients Perinatal deaths 0.038 2.096 0.176

Deworming Height 0.049 2.361 1.000

Micronutrients Test scores 0.052 1.694 0.966

Scholarships Enrollment rate 0.053 0.687 1.000

Conditional Cash Transfers Height-for-age 0.055 22.166 0.165

Deworming Weight-for-height 0.072 3.129 0.986

Micronutrients Stillbirths 0.075 3.041 0.108

School Meals Enrollment rate 0.081 1.142 0.080

Micronutrients Prevalence of anemia 0.095 0.793 0.692

Deworming Height-for-age 0.098 1.978 1.000

Deworming Weight-for-age 0.107 2.287 0.998

Micronutrients Diarrhea incidence 0.109 3.300 0.985

Micronutrients Diarrhea prevalence 0.111 1.205 0.837

Micronutrients Fever prevalence 0.146 3.076 0.667

Deworming Weight 0.184 4.758 1.000

Micronutrients Hemoglobin 0.215 1.439 0.984

SMS Reminders Appointment attendance rate 0.224 2.908 0.869

Deworming Mid-upper arm circumference 0.439 1.773 0.994

Conditional Cash Transfers Probability unpaid work 0.609 6.415 0.834

Rural Electrification Study time 0.997 1.102 0.142

How should we interpret these numbers? Higgins and Thompson, who defined I2, called 0.25 indicative of “low”, 0.5 “medium”, and 0.75 “high” levels of heterogeneity (2002;

Higgins et al., 2003). Figure 3 plots a histogram of the results, with lines corresponding to these values demarcated. Clearly, there is a lot of systematic variation in the results according to the I2 measure. No similar defined benchmarks exist for the variance or coefficient of variation, although studies in the medical literature tend to exhibit a coefficient of variation of approximately 0.05-0.5 (Tian, 2005; Ng, 2014). By this standard, too, results would appear quite heterogeneous.

(21)

Figure 3: Density of I2 values

We can also compare values across the different intervention-outcome combinations within the data set. Here, the intervention-outcome combinations that fall within the bottom third by variance have varpYiq ď 0.015; the top third have varpYiq ě 0.052. Similarly, the threshold delineating the bottom third for the coefficient of variation is 1.14 and, for the top third, 2.36; for I2, the thresholds are 0.78 and 0.99, respectively. If we expect these intervention-outcomes to be broadly comparable to others we might want to consider in the future, we could use these values to benchmark future results.

Defining dispersion to be “low” or “high” in this manner may be unsatisfying because the classifications that result are relative. Relative classifications might have some value, but sometimes are not so important; for example, it is hard to think that there is a meaningful difference between an I2 of just below 0.99 and an I2 of just above 0.99. An alternative benchmark that might have more appeal is that of the average within-study variance or coefficient of variation. If the across-study variation approached the within-study variation, we might not be so concerned about generalizability.

Table 6 illustrates the gap between the across-study and mean within-study variance, coefficient of variation, and I2, for those intervention-outcomes for which we have enough data to calculate the within-study measures. Not all studies report multiple results for the intervention-outcome combination in question. A paper might report multiple results for a particular intervention-outcome combination if, for example, it were reporting results for different subgroups, such as for different age groups, genders, or geographic areas. The median within-paper variance for those papers for which this can be generated is 0.027, while it is 0.037 across papers; similarly, the median within-paper coefficient of variation is 0.91, compared to 1.48 across papers. If we were to form the I2 for each paper separately, the median within-paper value would be 0.63, as opposed to 0.94 across papers. Figure

(22)

4 presents the distributions graphically; to increase the sample size, this figure includes results even when there are only two papers within an intervention-outcome combination or two results reported within a paper.

(23)

Table 6: Across-Paper vs. Mean Within-Paper Heterogeneity

Intervention Outcome Across-paper Within-paper Across-paper Within-paper Across-paper Within-paper

var(Yi) var(Yi) CV(Yi) CV(Yi) I2 I2

Micronutrients Cough prevalence 0.001 0.006 1.017 3.181 0.755 1.000

Conditional Cash Transfers Enrollment rate 0.009 0.027 0.790 0.968 0.998 0.682

Conditional Cash Transfers Unpaid labor 0.009 0.004 0.918 0.853 0.781 0.778

Deworming Hemoglobin 0.009 0.068 1.639 8.687 0.583 0.712

Micronutrients Weight-for-height 0.010 0.005 2.252 * 0.665 0.633

Micronutrients Birthweight 0.010 0.011 0.974 0.963 0.784 0.882

Micronutrients Weight-for-age 0.010 0.124 2.370 0.713 1.000 0.652

School Meals Height-for-age 0.011 0.000 1.086 * 0.942 0.703

Micronutrients Height-for-age 0.012 0.042 2.474 3.751 0.993 0.508

Unconditional Cash Transfers Enrollment rate 0.014 0.014 1.223 * 0.982 0.497

SMS Reminders Treatment adherence 0.022 0.008 1.479 0.672 0.958 0.573

Micronutrients Height 0.023 0.028 4.001 3.471 0.896 0.548

Micronutrients Stunted 0.024 0.059 1.085 24.373 0.348 0.149

Micronutrients Mortality rate 0.026 0.195 2.533 1.561 0.164 0.077

Micronutrients Weight 0.029 0.027 2.852 0.149 0.629 0.228

Micronutrients Fever prevalence 0.034 0.011 5.937 0.126 0.602 0.066

Microfinance Total income 0.037 0.003 1.770 1.232 0.970 1.000

Conditional Cash Transfers Probability unpaid work 0.046 0.386 1.419 0.408 0.989 0.517

Conditional Cash Transfers Attendance rate 0.046 0.018 0.591 0.526 0.988 0.313

Deworming Height 0.048 0.112 1.845 0.211 1.000 0.665

Micronutrients Perinatal deaths 0.049 0.015 2.087 0.234 0.451 0.089

Bed Nets Malaria 0.052 0.047 0.650 4.093 0.967 0.551

Scholarships Enrollment rate 0.053 0.026 1.094 1.561 1.000 0.612

Conditional Cash Transfers Height-for-age 0.055 0.002 22.166 1.212 0.162 0.600

HIV/AIDS Education Used contraceptives 0.059 0.120 2.863 6.967 0.424 0.492

Deworming Weight-for-height 0.072 0.164 3.127 * 1.000 0.907

Deworming Height-for-age 0.100 0.005 2.043 1.842 1.000 0.741

Deworming Weight-for-age 0.108 0.004 2.317 1.040 1.000 0.704

Micronutrients Diarrhea incidence 0.135 0.016 2.844 1.741 0.922 0.807

Micronutrients Diarrhea prevalence 0.137 0.029 1.375 3.385 0.811 0.664

Deworming Weight 0.168 0.121 4.087 1.900 0.995 0.813

Conditional Cash Transfers Labor force participation 0.790 0.047 2.931 4.300 0.378 0.559

Micronutrients Hemoglobin 2.650 0.176 2.982 0.731 1.000 0.996

Within-paper values are based on those papers which report results for different subsets of the data. For closer comparison of the across and within-paper statistics, the across-paper values are based on the same data set, aggregating the within-paper results to one observation per

23

(24)

intervention-outcome-paper, as discussed. Each paper needs to have reported 3 results for an intervention-outcome combination for it to be included in the calculation, in addition to the requirement of there being 3 papers on the intervention-outcome combination. Due to the slightly different sample, the across-paper statistics diverge slightly from those reported in Table 5. Occasionally, within-paper measures of the mean equal or approach zero, making the coefficient of variation undefined or unreasonable; “*” denotes those coefficients of variation that were either undefined or greater than 10,000,000.

24

(25)

Figure 4: Distribution of within and across-paper heterogeneity measures

We can also gauge the magnitudes of these measures by comparison with effect sizes.

We know effect sizes are typically considered “small” if they are less than 0.2 SDs and that the largest coefficient of variation typically considered in the medical literature is 0.5 (Tian, 2005; Ng, 2014). Taking 0.5 as a very conservative upper bound for a “small” coefficient of variation, this would imply a variance of less than 0.01 for an effect size of 0.2. That the actual mean effect size in the data is closer to 0.1 makes this even more of an upper bound;

applying the same reasoning to an effect size of 0.1 would result in the threshold being set at a variance of 0.0025.

Finally, we can try to set bounds more directly, based on the expected prediction error.

Here it is immediately apparent that what counts as large or small error depends on the policy question. In some cases, it might not matter if an effect size were mis-predicted by 25%; in others, a prediction error of this magnitude could mean the difference between choosing one program over another or determine whether a program is worthwhile to pursue at all.

Still, if we take the mean effect size within an intervention-outcome to be our “best guess” of how a program will perform and, as an illustrative example, want the prediction error to be less than 25% at least 50% of the time, this would imply a certain cut-off threshold for the variance if we assume that results are normally distributed. Note that the assumption that results are drawn from the same normal distribution and the mean and variance of this distribution can be approximated by the mean and variance of observed results is a simplification for the purpose of a back-of-the-envelope calculation. We would expect results to be drawn from different distributions.

Table 7 summarizes the implied bounds for the variance for the prediction error to be less than 25% and 50%, respectively, alongside the actual variance in results within each intervention-outcome. In only 1 of 51 cases is the true variance in results smaller than the variance implied by the 25% prediction error cut-off threshold, and in 9 other cases it is below the 50% prediction error threshold. In other words, the variance of results within each intervention-outcome would imply a prediction error of more than 50% more than 80%

of the time.

Table 7: Actual Variance vs. Variance for Prediction Error Thresholds

Intervention Outcome Y¯i varpYiq var25 var50

Microfinance Assets 0.003 0.000 0.000 0.000

(26)

Rural Electrification Enrollment rate 0.176 0.001 0.005 0.027

Micronutrients Cough prevalence -0.016 0.001 0.000 0.000

Microfinance Total income 0.029 0.001 0.000 0.001

Microfinance Savings 0.027 0.002 0.000 0.001

Financial Literacy Savings -0.012 0.004 0.000 0.000

Microfinance Profits -0.013 0.005 0.000 0.000

Contract Teachers Test scores 0.182 0.005 0.005 0.029

Performance Pay Test scores 0.131 0.006 0.003 0.015

Micronutrients Body mass index 0.125 0.007 0.002 0.014

Conditional Cash Transfers Unpaid labor 0.103 0.009 0.002 0.009

Micronutrients Weight-for-age 0.050 0.009 0.000 0.002

Micronutrients Weight-for-height 0.045 0.010 0.000 0.002

Micronutrients Birthweight 0.102 0.010 0.002 0.009

Micronutrients Height-for-age 0.044 0.012 0.000 0.002

Conditional Cash Transfers Test scores 0.062 0.013 0.001 0.003

Deworming Hemoglobin 0.036 0.015 0.000 0.001

Micronutrients Mid-upper arm circumference 0.058 0.015 0.001 0.003 Conditional Cash Transfers Enrollment rate 0.150 0.015 0.003 0.019 Unconditional Cash Transfers Enrollment rate 0.115 0.016 0.002 0.011

Water Treatment Diarrhea prevalence 0.145 0.020 0.003 0.018

SMS Reminders Treatment adherence 0.088 0.022 0.001 0.007

Conditional Cash Transfers Labor force participation 0.092 0.023 0.001 0.007

School Meals Test scores 0.117 0.023 0.002 0.012

Micronutrients Height 0.035 0.023 0.000 0.001

Micronutrients Mortality rate -0.054 0.025 0.000 0.003

Micronutrients Stunted 0.143 0.025 0.003 0.018

Bed Nets Malaria 0.342 0.029 0.018 0.101

Conditional Cash Transfers Attendance rate 0.333 0.030 0.017 0.096

Micronutrients Weight 0.068 0.034 0.001 0.004

HIV/AIDS Education Used contraceptives 0.061 0.036 0.001 0.003

Micronutrients Perinatal deaths -0.093 0.038 0.001 0.008

Deworming Height 0.094 0.049 0.001 0.008

Micronutrients Test scores 0.134 0.052 0.003 0.016

(27)

Micronutrients Fever prevalence 0.124 0.146 0.002 0.013

Deworming Weight 0.090 0.184 0.001 0.007

Micronutrients Hemoglobin 0.322 0.215 0.016 0.090

SMS Reminders Appointment attendance rate 0.163 0.224 0.004 0.023 Deworming Mid-upper arm circumference 0.373 0.439 0.021 0.121 Conditional Cash Transfers Probability unpaid work -0.122 0.609 0.002 0.013

Rural Electrification Study time 0.906 0.997 0.125 0.710

var25 represents the variance that would result in a 25% prediction error for draws from a normal distribution centered at ¯Yi. var50 represents the variance that would result in a 50% prediction error.

4.2 With Modelling Heterogeneity

4.2.1 Across Intervention-Outcomes

All the results so far have not considered how much heterogeneity can be explained.

If the heterogeneity can be systematically modelled, it would improve our ability to make predictions. Do results exhibit any variation that is systematic? To begin, I first present some OLS results, looking across different intervention-outcome combinations, to examine whether effect sizes are associated with any characteristics of the program, study, or sample, pooling data from different intervention-outcomes.

As Table 8 indicates, there is some evidence that studies with a smaller number of observations have greater effect sizes than studies based on a larger number of observations.

This is what we would expect if specification searching were easier in small datasets; this pattern of results would also be what we would expect if power calculations drove researchers to only proceed with studies with small sample sizes if they believed the program would result in a large effect size or if larger studies are less well-targeted. Interestingly, government- implemented programs fare worse even controlling for sample size (the dummy variable category left out is “Other-implemented”, which mainly consists of collaborations and private sector-implemented interventions). Studies in the Middle East / North Africa region may appear to do slightly better than those in Sub-Saharan Africa (the excluded region category), but not much weight should be put on this as very few studies were conducted in the former region.

While these regressions have the advantage of allowing me to draw on a larger sample of studies and we might think that any patterns observed across so many interventions and outcomes are fairly robust, we might be able to explain more variation if we restrict attention to a particular intervention-outcome combination. I therefore focus on the case of conditional cash transfers (CCTs) and enrollment rates, as this is the intervention-outcome combination that contains the largest number of papers.

(28)

Table 8: Regression of Effect Size on Study Characteristics

(1) (2) (3) (4) (5)

Effect size Effect size Effect size Effect size Effect size

b/se b/se b/se b/se b/se

Number of -0.011** -0.012*** -0.009*

observations (100,000s) (0.00) (0.00) (0.00)

Government-implemented -0.107*** -0.087**

(0.04) (0.04)

Academic/NGO-implemented -0.055 -0.057

(0.04) (0.05)

RCT 0.038

(0.03)

East Asia -0.003

(0.03)

Latin America 0.012

(0.04)

Middle East/North 0.275**

Africa (0.11)

South Asia 0.021

(0.04)

Constant 0.120*** 0.180*** 0.091*** 0.105*** 0.177***

(0.00) (0.03) (0.02) (0.02) (0.03)

Observations 556 656 656 556 556

(29)

4.2.2 Within an Intervention-Outcome Combination: The Case of CCTs and Enrollment Rates

The previous results used the across-intervention-outcome data, which were aggregated to one result per intervention-outcome-paper. However, we might think that more variation could be explained by carefully modelling results within a particular intervention-outcome combination. This section provides an example, using the case of conditional cash transfers and enrollment rates, the intervention-outcome combination covered by the most papers.

Suppose we were to try to explain as much variability in outcomes as possible, using sample characteristics. The available variables which might plausibly have a relationship to effect size are: the baseline enrollment rates9; the sample size; whether the study was done in a rural or urban setting, or both; results for other programs in the same region10; and the age and gender of the sample under consideration.

Table 9 shows the results of OLS regressions of the effect size on these variables, in turn.

The baseline enrollment rates show the strongest relationship to effect size, as reflected in the R2 and significance levels: it is easier to have large gains where initial rates are low.

Some papers pay particular attention to those children that were not enrolled at baseline or that were enrolled at baseline. These are coded as a “0%” or “100%” enrollment rate at baseline but are also represented by two dummy variables. Larger studies and studies done in urban areas also tend to find smaller effect sizes than smaller studies or studies done in rural or mixed urban/rural areas. Finally, for each result I calculate the mean result in the same region, excluding results from the program in question. Results do appear slightly correlated across different programs in the same region.

9In some cases, only endline enrollment rates are reported. This variable is therefore constructed by using baseline rates for both the treatment and control group where they are available, followed by, in turn, the baseline rate for the control group; the baseline rate for the treatment group; the endline rate for the control group; the endline rate for the treatment and control group; and the endline rate for the treatment group

10Regions include: Latin America, Africa, the Middle East and North Africa, East Asia, and South Asia, following the World Bank’s geographical divisions.

(30)

Table 9: Regression of Projects’ Effect Sizes on Characteristics (CCTs on Enrollment Rates)

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

ES ES ES ES ES ES ES ES ES ES

b/se b/se b/se b/se b/se b/se b/se b/se b/se b/se

Enrollment Rates -0.224*** -0.092 -0.127***

(0.05) (0.06) (0.02)

Enrolled at Baseline -0.002

(0.02)

Not Enrolled at 0.183*** 0.142***

Baseline (0.05) (0.03)

Number of -0.011* -0.002

Observations (100,000s) (0.01) (0.00)

Rural 0.049** 0.002

(0.02) (0.02)

Urban -0.068*** -0.039**

(0.02) (0.02)

Girls -0.002

(0.03)

30

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Exakt hur dessa verksamheter har uppstått studeras inte i detalj, men nyetableringar kan exempelvis vara ett resultat av avknoppningar från större företag inklusive

Coad (2007) presenterar resultat som indikerar att små företag inom tillverkningsindustrin i Frankrike generellt kännetecknas av att tillväxten är negativt korrelerad över

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Regioner med en omfattande varuproduktion hade också en tydlig tendens att ha den starkaste nedgången i bruttoregionproduktionen (BRP) under krisåret 2009. De

This generic offering could be seen as an intermediary or boundary object as it is often named in academic literature; something that the company and the students could use as

The EU exports of waste abroad have negative environmental and public health consequences in the countries of destination, while resources for the circular economy.. domestically