Just found out of an important line of research, pushed forward by a Berkeley PhD, Eva Vivalt, about how much we can generalize from impact evaluations. The deeper question is that, if we’re willing to make science, we have to worry about finding invariance and laws, and that involves abstracting away from the particularities of each experiment to gather what remains standing. In other words, as in Physics and in the other ‘hard sciences’, we need mounting evidence to start challenging our theories and building new ones about how the world works. And, for that we need to compare different causal effects found at specific points in spacetime.

Here is the abstract.

“Impact evaluations aim to predict the future, but they are rooted in particular contexts and results may not generalize across settings. I founded an organization to systematically collect and synthesize impact evaluation results on a wide variety of interventions in development. These data allow me to answer this and other questions for the first time using a large data set of studies. I examine whether results predict each other and whether variance in results can be explained by program characteristics, such as who is implementing them, where they are being implemented, the scale of the program, and what methods are used. I find that when regressing an estimate on the hierarchical Bayesian meta-analysis result formed from all other studies on the same intervention-outcome combination, the result is significant with a coefficient of 0.5-0.7, though the R2 is very low. The program implementer is the main source of heterogeneity in results, with government-implemented programs faring worse than and being poorly predicted by the smaller studies typically implemented by academic/NGO research teams, even controlling for sample size. I then turn to examine specification searching and publication bias, issues which could affect generalizability and are also important for research credibility. I demonstrate that these biases are quite small; nevertheless, to address them, I discuss a mathematical correction that could be applied before showing that randomized controlled trials (RCTs) are less prone to this type of bias and exploiting them as a robustness check.”

And here is the link to the working paper and to her blog, a worthy read too.