There’s No Escape from External Validity – Reporting Habits of Randomized Controlled Trials

(1)

There’s No Escape from External Validity – Reporting Habits of Randomized Controlled Trials

Jörg Peters, Jörg Langbein, and Gareth Roberts

Version March 13, 2015 – Please contact authors for an updated version before citing

Abstract:

Randomized Controlled Trials (RCT) are considered the gold standard in empirical social sciences and have been increasingly used in recent years. While their internal validity is in most cases beyond discussion, RCTs still need to establish external validity. External valility is the crucial determinant for a study’s policy relevance and might be at stake because of potential general equilibrium effects, Hawthorne effects, or representativeness problems that compromise generalizing results beyond the studied population. For this paper, we reviewed all RCTs published in leading economic journals between 2009 and 2012 and scrutinized them for the way in which they treat external validity. Based on a set of objective indicators, we find that the RCT literature does not adequately account for potential hazards to external validity.

A large part of published RCTs does not discuss potential limitations to external validity or provide the information that is necessary to assess potential problems. We conclude by calling for a more systematic approach to designing RCTs and to reporting the results.

Keywords: systematic review, internal validity, external validity, randomized controlled trials.

JEL Classification: C83, C93

1Jörg Peters, RWI and AMERU, University of the Witwatersrand, Johannesburg, South Africa; Jörg Langbein, RWI; Gareth Roberts, AMERU, University of the Witwatersrand, Johannesburg, South Africa. All correspondence to: Jörg Peters, RWI, Hohenzollernstraße 1-3, 45128 Essen, Germany, e-mail: peters@rwi-essen.de, phone: ++49-201-8149-247.

We are grateful for valuable comments and suggestions by Marc Andor, Michael Grimm, Stephan Klasen as well as seminar particpants at University of Passau and University of Göttingen.

(2)

2 1. Introduction

Most of the researcher’s energy in empirical social sciences is – for good reasons – absorbed by endeavours to ensure the internal validity of her study. In a nutshell, internal validity is achieved if the observed effect is indeed a causal one. Hence, internal validity is the necessary condition for a study to have any policy relevance.

External validity prevails if the study’s findings can be transferred from the study population to the policy population. Thus, conditional on internal validity, the external validity of an empirical study is nothing less than the sufficient condition for its policy relevance.

The pertinence of internal and external validity of empirical research has obviously increased with the increasing number of empirical research studies. The share of empirical papers based on micro-data published in the economic top outlets has risen constantly over the last 30 years. One method stands out in terms of methodological rigor: Randomized controlled trials (RCTs). RCTs are experimental studies that are implemented not in the laboratory but in the field and, hence, under real-world conditions. They captivate economists’ hearts by a striking internal validity: Self- selection into treatment, which has been the nightmare of any empirical researcher for a long time, is no longer a problem due to the randomized assignment of the treatment.

The beauty of RCTs, the high internal validity, is frequently contrasted with shortcomings in external validity. Many critics state that establishing the external validity is in many cases more difficult for RCTs than for studies based on observational data (MOFFIT 2004, TEMPLE 2010). The reason for this concern is that RCTs can mostly be done in a limited region and monitored over few years only, whereas observational studies can use panel data that cover many years and whole countries or more. Furthermore, the controlled and experimental character of RCTs is suspected to create an environment from which findings cannot be readily transferred to non-study set-ups. In particular, to the extent participants in an RCT

(3)

are aware of their participation in an experiment they can be expected to behave differently than they would under real-world conditions.

These concerns about external validity are well-known and have been widely discussed. A very prominent criticism has been brought forward by Dani Rodrik (RODRIK 2009). He states that external validity is never established by the study itself and that RCTs in many cases leave many parameters that co-determine the results behind a veil because of their limited scope in terms of the population being studied and the specific experimental conditions under which they are effectuated. For observational studies it might be easier to cover a broader scope. Therefore, he argues, RCTs require “credibility-enhancing arguments” on the external validity side – just as observational studies have to argue on the internal validity side.

Already in 2005, during the symposium on “New directions in development economics: Theory or empirics?” Kashu Basu brings up very fundamental arguments calling for caution in interpreting results from empirical studies with (naturally) limited scope. He claims that any generalization requires adding “unscientific intuition” to the statistical findings (BASU 2005). In the symposium’s response to this, Abihijit Banerjee, one of the most prominent proponents of RCTs, acknowledges the requirement to establish external validity for RCTs (BANERJEE 2005). He explicitly stresses potential limitations in transferring experimental findings from one region to another. In addition, Banerjee emphasizes the threat of general equilibrium effects for the external validity of some RCTs. Here, he calls for a theoretical framework that allows for out-of-sample predictions under certain assumptions. This theoretical framework would thereby offer the “credibility-enhancing arguments” that Rodrik called for. To conclude, Banerjee, Basu and Rodrik seem to agree that generally external validity is never a self-evident fact in empirical research and particularly RCTs have to discuss the extent to which the results are generalizable.

Against this background, we conduct a systematic review that examines the extent to which papers published in top economic journals discuss their external validity and thereby follow the recommendation of Banerjee and Rodrik. What is the practice in

(4)

4

conducting RCTs and in reporting the results? We reviewed all RCT-based papers published between 2009 and 2012 in the American Economic Review, the Quarterly Journal of Economics, Econometrica, the Journal of Public Economics, the Economic Journal, and the American Economic Journal: Applied Economics. In total, we included 46 RCT- based papers published in these journals and scrutinized them with regards to the different dimensions affecting external validity.

As a basis for our review, we use the comprehensive presentation of external validity and its different dimensions in the seminal toolkit for the implementation of RCTs by Esther Duflo, Rachel Glennerster and Michael Kremer – all three leading scholars in the field (DUFLO,GLENNERSTER AND KREMER 2008). They present three dimensions of hazards to external validity: General equilibrium effects, Hawthorne and John Henry effects, and problems of generalizability beyond specific programs and samples. Along the lines of these three dimensions we formulate 09 questions that we asked each of the 46 papers. All questions can be objectively answered by “yes” or “no”.

In the remainder of the paper we first present the concept of external validity and its different dimensions in more detail (Section 2), before the methodological approach and the 09 questions are discussed (Section 3). The results are presented in Section 4.

Section 5 concludes.

2. Hazards to External Validity

In order to guide the introduction of the different dimensions of external validity we start by an example that exposes the potential hazards in a very stylized way.

Suppose you have 12 year old twins and suppose both have the same sex, say boys.

You now give one of them 10 Dollars, the other gets nothing. The lucky receiver can do with the money whatever he wants to do. Assume he buys some candies, some football cards and air time for his cell phone. His twin brother only gets some candies, since he cannot afford the football cards and the air time. What would this observation tell us about giving untied 10 Dollars to other people in general? The answer first depends on why you gave the 10 Dollars to the lucky receiver and not to

(5)

his brother. Maybe you selected him, since you expect him to use the money more responsibly. Or he might have been the first to raise his hand when you offered the 10 Dollars. In this case, comparing the two is an apples-and-oranges comparison; the observation is internally not valid.

If, in contrast, you throw a coin and let fate decide, the observed consumption patterns might tell you quite a bit. At least if other people whose consumption behavior you are interested in are also 12 years old. Would we expect the consumption patterns to change if the 10 Dollars are given to adults? Or to girls? For sure we would. Even if we assume we give the 10 Dollars only to 12 year old boys – but now to every single 12 year old boy in the country. Would we expect the consumption patterns to change? Yes, some candies and certainly football cards might run short. Thus, some boys might not be able to obtain football cards or certain candies – or their prices might rise.

Let’s go one step back and re-conduct the game but let’s assume now we explicitly inform both boys that we do a 10-Dollars-lottery and observe their subsequent behavior. Will the outcome be the same? Probably not. The lucky recipient might now act more in line with what he expects you to expect from him. He might buy some stationery, abstain from buying the candies, stick to the football cards (in case he knows that you like football) and even save a bit for any later investment (in case he deems you to appreciate hyperopia). Any repetition of the exercise at larger scale for which the recipients will not be observed will yield different consumption patterns than the one observed (even if only 12 year old boys get money).

While this example is admittedly a very stylized one, it reveals the fragility of the logic chain between internal validity and policy relevance. It is used in the following to visualize the first comprehensive presentation of external validity in the literature provided in the toolkit on how to implement RCTs by DUFLO, GLENNERSTER, AND

KREMER (2008). They introduce external validity as the question “whether the impact we measure would carry over to other samples or populations. In other words,

(6)

6

whether the results are generalizable and replicable”. Duflo, Glennerster and Kremer (DGK in the following) present three dimensions of hazards to external validity.

The first dimension arises from potential general equilibrium effects (GEE). Such GEE occur if the program is upscaled to a broader population and extended to a longer term. To be precise, GEE only become noticeable if the program is upscaled, since in the non-upscaled case the effects are too small. In the twin example provided above, GEE occur if many kids receive the 10 Dollar-payment and some of the goods kids want to buy become scarcer and, thus, more expensive. DGK rightly point out that some sort of GEE will always occur, irrespective of the scope of the study. Even if a study examines the whole country, GEE might occur on world level. Nonetheless, the severity of GEE depends on some parameters, most notably the size of the area included in the RCT and the impact indicators the study looks at. For market based outcomes like wages or employment status GEE can often be expected to be much more pronounced than for non-market outcomes like immunization in a vaccination program or educational outcomes (if they are considered being a means to the end, not as labor market outcomes). Thus, the hazard that GEE constitute for the external validity of a study is not always the same and a profound discussion on the GEE relevant features can provide Rodrik’s “credibility-enhancing arguments”.

While GEE in principle also threaten the external validity of observational studies, the next dimension is a particularity of experimental research: Hawthorne and John Henry effects might occur if the participants in an RCT know or notice that they are participating in an experiment and under observation. It is obvious that this could lead to an altered behavior in the treatment group (Hawthorne effect) and/or the control group (John Henry effect). In the twin example above the receiver of the 10 Dollars can be expected to spend the money differently in case he knows that his mother is observing him. It is also obvious that such behavioral responses due to the mere presence of the experiment can be expected to differ between different experimental set-ups. If the experiment is embedded into a business-as-usual set up, for example ANDERSON AND SIMESTER (2010) who send out catalogues without

(7)

mentioning an experiment or a study related to the catalogues towards the recipients, distortions of participants’ behavior is quite unlikely. In contrast, if the randomized intervention clearly interferes with the participants’ daily life (for example, an NGO appearing in an African village to randomize a certain training among the villagers), intuition clearly suggests that participants do act differently than they would under non-experimental conditions. Hence, furnishing the reader with detailed information on how the RCT was implemented, how participants were approached and qualitative or quantitative evidence on how the experiment was conceived can provide Rodrik’s “credibility-enhancing arguments”.

The third hazard to external validity DGK discuss are problems of generalizability.

DGK distinguish three sources: First, the treatment might be provided with special care in the RCT that makes the treatment different from what would be done in an upscaled program. In the twin example, an upscaled lump sum payment would maybe not be provided by the mother but by any more neutral person, which could of course affect the subsequent consumption pattern. Second, the specific sample problem is induced by a study population being different from the policy population in which the intervention would take place in case of an upscaling. The response of the winning twin can be expected to be different if the same experiment is conducted among 12 year old girls, among adults or in a different country. Third, the similar but not identical program problem arises if the treatment in the real world might deviate slightly from the treatment that is provided for the RCT. If a policy maker wants to learn how 12 years old boys react to a lump sum payment of 20 Dollars instead of 10 Dollars results might be different.

In the next section, these dimensions of external validity are translated into indicators to be applied during the review of published RCTs.

(8)

8 3. Methods and Data

3.1. Review approach

We reviewed all RCTs published between 2009 and 2012 in a selection of the leading journals in the field. We included the five most important economic journals, namely the American Economic Review, Econometrica, the Quarterly Journal of Economics, the Journal of Political Economy and the Review of Economic Studies.¹ In addition to these Top-5 journals, we included further leading general interest journals that publish empirical work and RCTs in particular: The Economic Journal, the Journal of Public Economics, and the American Economic Journal: Applied Economics.

We scrutinized all issues in the period and identified those papers that mention the terms “experiment”, “field experiment”, “randomized controlled trials” or

“experimental evidence” in either the title or the abstract and thereby identified 57 papers. We furthermore used the taxonomy by HARRISON AND LIST (2004) to identify RCTs that intend to evaluate a policy intervention. Lab experiments and what Harrison and List classify as “artefactual field experiments” are mostly used to test parameters of economic behavior (and not a certain policy) and are therefore excluded from this review. We concentrate on what Harrison and List refer to as

“framed field experiments” and “natural field experiments”, since they are most closely to what the common sense of RCTs and their policy relevance reflects. In spite of the fine taxonomization provided by Harrison and List, published RCTs hardly classify themselves along these lines. In the vast majority of cases the demarcation was nonetheless very obvious and we excluded 11 papers because we classified them as artefactual experiments.² In total, we found 46 papers that qualify as an RCT in Harrison and List’s taxonomy. The distribution across journals is uneven with the vast majority being published in the American Economic Journal: Applied Economics, the American Economic Review and the Quarterly Journal of Economics (see Figure 1).

1 No RCT was found in the Journal of Political Economy and the Review of Economic Studies between

2009 and 2012.

2 A comprehensive list of both included and excluded papers can be obtained from the authors.

(9)

Figure 1: Published RCTs between 2009 and 2012

Each paper was asked 9 obvjective questions related to external validity and its three dimensions outlined in Section 2 that can be answered by either “yes” or “no”. These objective questions simply examine whether the claim of Dani Rodrik and Abidjihid Banerjee is fulfilled and the plausibility of external validity is established.

“Credibility-enhancing arguments” require a discussion of the constituting dimensions of external validity. While there might not be a common understanding of what constitutes external validity, Esther Duflo, Rachel Glennerster and Michael Kremer are leading scholars in the field. We hence assume that most researchers who conduct RCTs are aware of their toolkit published in the Handbook of Development Economics.

We abstained from applying subjective ratings in order to avoid room for arbitriness.

One might argue, though, that a discussion of external validity is only required if it is obviously at stake. Some RCT designs are more prone to external validity hazards than others. For Hawthorne Effects, we account for this by also asking one subjective question to a random subset of the reviewed papers that assess the degree to which a paper is prone to be affected to such effects.

0 2 4 6 8 10 12 14 16 18 20

American Economic Review

AEJ: Applied Economics

Quarterly Journal of Economics

Journal of Public Economics

The Economic Journal

Econometrica

(10)

10 3.2. Nine questions

In order to elicit the extent to which Hawthorne and John Henry effects are accounted for we asked the following objective questions:

1. Does the paper explicitly mention the term “Hawthorne effect” or “John- Henry effect”?

2. Does the paper explicitly say if participants are aware of being part of an experiment or a study?

“Credibility-enhancing arguments” against potential Hawthorne- or John-Henry effects in an RCT require at least a proper description of how the experiment was presented to the participants. Obviously, the information whether participants know that they are participating in an experiment is crucial for the reader to assess whether Hawthorne and John-Henry effects might occur.

For those papers that explicitly state that peole are aware of being part of an experiment we additionally raise the question:

3. Does the paper (try to) account for Hawthorne or John-Henry effects (in the design of the study, in the interpretation of results or the size of the effect)

Furthermore, a subjective question was asked to a random subset of reviewed papers: Is the study prone to Hawthorne and/or John Henry effects? While this question is clearly subjective, it is based on specified criteria that try to grasp the extent to which participants might notice that they are part of an experiment (in case they are not told) and the extent to which participants’ behavior might be affected by the awareness of being part of an experiment. The underlying questions are: What was the level of randomization (e.g. region, school/hospital etc., individual household)? How were participating units contacted? Was the data collected by means of a survey or a business-as-usual reporting? Who conducted the survey? Was a baseline conducted that was perceivable as a baseline? Were participants

(11)

monitored while the treatment was still ongoing or during the effects were still supposed to unfold? Furthermore, in answering this subjective question we also account for potential reporting biases (e.g. gratitude bias) in the participants’ answers that might aggravate Hawthorne- and John-Henry-effects. We provide a paper specific narrative assessment on the randomly picked subset of reviewed papers in the Electronic Appendix.³

The next set of questions probes into general equilibrium effects. As outlined in Section 2, we define general equilibrium effects as changes due to an intervention that occur in a noticeable way only after a longer time period or if the intervention is upscaled.

We reviewed the papers asking the following question:

4. Does the paper explicitly mention the term general equilibrium effects?

The term is widely used in the academic discussion about the external validity of empirical research and RCTs in particular. It is therefore quite likely that a paper that accounts for general equilibrium effects also uses this expression. The next question probes into the time-wise dimension of general equilibrium effects:

5. Does the paper explicitly discuss what might happen in the long run or what might happen if the program is upscaled?

We give the answer “yes” as soon as this issue is mentioned in the paper, irrespective of whether we consider the discussion to be comprehensive. The indicator is thus applied in a very conservative way.

The third dimension is what DGK call “Generalizing beyond Specific Programs and Samples”. Also in line with DGK we capture this dimension by scrutinizing the papers for whether they were conducted in a specific sample from which it is difficult

(12)

12

to transfer the findings to other settings and the non-RCT world and the level of special care that was dedicated to the treatment because it was part of an RCT.

In particular, we pursue the following questions:

6. Does the paper explicitly mention the term transferability or generalizability?

7. Does the paper discuss the representativeness of the study population for the policy population?

In most cases, RCTs study a sub-population of the population for which it wants to derive policy implications. For example, a certain region of a country might be included in an RCT, but the intervention under scrutiny is supposed to be applied to the whole country or even beyond.

The special care sub-dimension defined by DGK is accounted for by the following questions:

8. Does the paper explicitly discuss the extent of special care that is used in implementing the intervention (in demarcation to the real world intervention)?

For this question, we did not expect papers to use the term ”special care” since it is not as broadly used as general equilibrium effects, or Hawthorne and John Henry effects. We rather examined if certain particularities of how the randomized treatment was provided are discussed or if the paper states (explicitly) that care was taken to conduct the treatment in a way that is close to the real-world case. As soon as such statements are made, we answered the question with “yes”, irrespective of our personal judgement if we deem the statement to be comprehensive.

(13)

BOLD ET AL. (2013) provide compelling evidence for the special care-effect in an RCT that was scaled up based on positive effects observed in a smaller RCT conducted by DUFLO, DUPAS, AND KREMER (2012). The major difference was that the scaled-up program examined in Bold et al. was implemented by the national government, whereas the smaller one examined by Duflo, Dupas and Kremer had been implemented by an NGO. The positive results could not be replicated. According to the authors these “results suggest that scaling-up an intervention (typically defined at the school, clinic, or village level) found to work in a randomized trial run by a specific organization (often an NGO chosen for its organizational efficiency) requires an understanding of the whole delivery chain. If this delivery chain involves a government Ministry with limited implementation capacity or which is subject to considerable political pressures, agents may respond differently than they would to an NGO-led experiment”. We therefore elicit for every paper we review:

9. Who is the implementation partner of the RCT?

4. Results

Our review has shown that virtually none of the reviewed papers addresses external validity comprehensively as it is presented in DUFLO, GLENNERSTER AND KREMER

(2008). Table 1 shows the results of our systematic review for the different dimensions of external validity according to the ten questions presented in the previous section. For general equilibrium effects and Hawthorne/John-Henry effects, the vast majority of published papers does not mention the terms; discuss related implications or accounts for potential problems. For Hawthorne/John-Henry effects it is particulalrly striking that more than 40 percent of the published papers do not mention whether people are aware of being part of an experiment – which is of course of decisive importance for the reader to assess whether distorting behavioral responses might occur. For obvious reasons, many studies only look at effects in the short- or mid-term only. While this is in most cases probably inevitable, a discussion

(14)

14

mention or discuss implications fpr long-term effects at all. Likewise, the upscalability of the intervention is hardly discussed, although upscaling the randomized intervention should be the goal (at least in case of positive findings).

Table 1: Reporting habits of published RCTs

Question Question applies to Percent with No

Hawthorne and John-Henry Effect

1. Hawthorne or John-Henry Effect explicitly

mentioned? 44 papers 91

2. Does the paper explicitly say, if participants are

aware of being part of an experiment or a study? 44 studies 59

3. Does the paper (try) to account for Hawthorne or

John Henry effects? 13 studies 62

General Equilibrium Effects

4. General Equilibrium Effects mentioned? 44 studies 82

5. Discusses the long run? 44 studies* 80

6. Discusses upscalability? 44 studies* 84

Transferability and Generalizability

7. Transferability or generalizability mentioned? 44 studies 40 8. Representativeness of study population

discussed? 44 studies 43

9. Special Care discussed? 44 studies 77

The majority of published papers mentions and discusses the generalizability of findings beyond the study population from a representativeness perspective. In fact, external validity is frequenly reduced to this dimsenion of transferability/generalizability (our questions 7 and 8). Still, the fact that 40 percent of papers do not make any statement on the extent to which the studied population is representative for a certain policy population is surprising. As the results for Question 9 show, potential implications of Special Care are barely discussed.

The results on question 9 are shown in Figure 1. Around one half of the published RCTs were implemented by either a large firm or a governmental body – which

(15)

resembles most to the natural business-as-usual situation. Slightly more than 50 percent of RCTs were implemented by either the researchers themselves or an NGO.

Figure 2: Implementation partners of published RCTs (30 studies included)

5. Conclusion

In theory there seems to be a consensus among empirical researchers that establishing external validity of a study is as important as establishing its internal validity. Against this background, this paper has systematically reviewed the existing RCT literature in order to examine the extent to which this happens in practice, i.e. whether external validity concerns are addressed in the practice of conducting field experiments. We have reviewed all papers based on RCTs published in the leading economics journals and found that hardly any of them comprehensively addresses the different dimensions of external validity as summarized by GLENNERSTER, DUFLO AND KREMER (2008) in their toolkit for the implementation of RCTs. Many published RCTs do not provide a comprehensive presentation of how the experiment was implemented. What is particularly striking is that almost half of the papers state whether the participants in the experiment are

National Government / Big

Firms 37%

Regional Government

10%

Non Governmental

Organisations 30%

Researcher 23%

(16)

16

aware of being part of an experiment – which is obviously important information for the reader to judge whether distorting Hawthorne- or John-Henry-effects might occur. Also general equilibrium effects are only rarely addressed. The majority of published papers does discuss the question of whether the population in which the study was implemented can serve as an appropriate representation of the policy population in which the intervention would be implemented in the real world. It is important to emphasize that we answered all questions quite conservatively. The questions were answered with yes also if the respective paper contains only a brief discussion of a certain hazard. We do not rate the profoundness of the discussion, for example if additional evidence is provided.

It is by no means the purpose of this paper to argue against RCTs. Virtually all of the papers we have reviewed are excellent studies. We have learned, though, that there is a wide variety between the reviewed papers in terms of their external validity.

Some are beyond discussion in terms of Hawthorne effects and generalizability – for example because they study a whole country and because there is hardly any perceivable contact with the participants. In other cases, strong doubts prevail, for example because only small excerpts of the policy population are studied and there is intense contact between the field researchers and the participants. There would be nonetheless ways to reduce Hawthorne effects in the study design and to extrapolate from the study population to the policy population, but the required assumptions and limitations are left behind a veil in many papers. It might even be the case that most studies we reviewed account for these concerns in a convincing way. In our views, it is unfortunate that the assumptions the researchers make on the different dimensions of external validity are not discussed openly in the paper (or an appendix).

We therefore call for dedicating the same efforts to establish external validity as it is done for internal validity. The extent to which the different hazards to external validity apply to a specific study should be discussed transparently in line with what has been documented in toolkits and methodological discussions. The role model for

(17)

this could be the consort statement in the medical literature. Eventually, the conclusion of LEVITT AND LIST (2007) on lab experiments also apply to RCTs: “By anticipating the types of biases common to the lab, experiments can be designed to minimize such biases. Further, knowing the sign and plausible magnitude of any biases induced by the lab, one can extract useful information from a study, even if the results cannot be seamlessly extrapolated outside the lab.”

(18)

18 References

Anderson, E. and Simester, D. (2010). ´Price stickiness and consumer antagonism`, Quarterly Journal of Economics, vol. 125(2), pp. 729-765.

Banerjee, A., Bardhan, P., Basu, K., Kanbur, R. and Mookherjee, D. (2005). ‘New Directions in Development Economics: Theory or Empirics?’, in BREAD Working Paper No. 106, A Symposium in Economic and Political Weekly.

Bold, T., Kimenyi, M., Mwabu, G., Ng'ang'a, A. and Sandefur, J. (2013) “Scaling up what works: Experimental Evidence on External Validity in Kenyan Education“

Center for Global Development Working Paper Series, No. 321.

Cohen, J. and Dupas, P. (2010). ‘Free distribution or cost-sharing? Evidence from a randomized Malaria prevention experiment’, Quarterly Journal of Economics, vol. 125(1), pp. 1-45.

Duflo, E., Glennerster, R. and Kremer, M. (2008). ‘Using randomization in development economics research: a toolkit’, in (P. Schultz and J. Strauss, eds.), Handbook of Development Economics, pp. 3895-962, Amsterdam: North Holland.

Harrison, G.W. and List, J. A. (2004). `Field experiments´, Journal of Economic Literature, Vol. 42 (4), pp. 1009 – 1055.

Levitt, S. D., and List, J. A. (2007). ´What do Laboratory Experiments Measuring Social Preferences Reveal about the real world?, The Journal of Economic Perspectives, Vol. 21 (2), pp. 153-174.

Moffit, R. (2004). `The role of randomized field trials in social science research´, American Behavioral Scientist, vol. 47 (5), pp. 506-540.

Muralidharan, K. and Sundararaman, V. (2010) ‘The Impact of Diagnostic Feedback to Teachers on Student Learning: Experimental Evidence from India’, Economic Journal, vol. 120(546), pp. 187-203.

Rodrik, D. (2009) ‘The new development economics: We shall experiment, but how shall we learn?’, in (W. Easterly and J. Cohen), What works in development?

Thinking big and thinking small, pp. 24-54, Brookings Institution Press.

Temple, J.R.W. (2010) “Aid and Conditionality” Handbook of Development Economics, pp. 4417-4511, Amsterdam: North Holland.

(19)

Zwane, A.P., Zinman, J. Van Dusen, E., Pariente, W., Null, C., Miguel, E., Kremer, M.

Karlan, D.S., Hornbeck, R., Giné, Y., Duflo, E., Devoto, F., Crepon, B. and Banerjee, A. (2011). ‘Being surveyed can change later behavior and related parameter estimates’, Proceedings of the National Academy for Science (PNAS), vol. 108(5), pp. 1821-26.