Randomized control trials (RCTs) are all the rage at the moment in economics and, increasingly, in the analysis and even formation of public policy. In the world of development economics, in particular, the turn away from conventional empirical…
Randomized control trials (RCTs) are all the rage at the moment in economics and, increasingly, in the analysis and even formation of public policy. In the world of development economics, in particular, the turn away from conventional empirical methods—the analysis of existing data sets using classical statistical techniques—towards designing RCTs has been pioneered, among others, by economists such as Indian-born Abhijit Banerjee and French-born Esther Duflo, both professors at the Massachusetts Institute of Technology.
Their widely acclaimed and successful book, Poor Economics: A Radical Rethinking of the Way to Fight Global Poverty (Random House India, 2011), drawing on examples from throughout the developing world—including India—galvanized a wholesale embrace of their methodology.
This has subsequently spawned a veritable cottage industry of research papers, both by them, associated scholars and students, and a clutch of followers. This is to say nothing of other practitioners of the RCT methodology in areas unconnected to their work, reflecting a larger embrace of this new approach by the economics profession at large.
Most recently, RCTs have ventured beyond their traditional domain, development economics, and other areas of applied microeconomics, and have begun to colonize a larger swathe of economics and public policy discussions. A widely commented upon leader in The Economist, “In praise of human guinea pigs” (12 December 2015), argues quite aggressively for the widespread use of RCTs in the evaluation and formation of public policy in advanced economies, much as they are already being used in developing and emerging economies.
But what exactly is this putative RCT revolution hailed by Banerjee and Duflo, which they claim grandly in the subtitle of their book will lead to a radical rethinking of the fight against poverty?
Put simply, it takes a leaf from the natural sciences, medicine in particular, and assesses the effect of a particular treatment. This is done by separating a population of subjects into those who receive the treatment in question and those who remain a control group. Individuals are randomly assigned to one or the other group.
Thus, in theory, any other differences between them should cancel out, on average. Given that the two groups are identical, or nearly identical, in every other respect, any difference in outcomes must, therefore, be attributable to the treatment.
The beauty of this approach is that in one fell sweep, it cuts through questions of causality that haunt conventional statistical analysis of existing data. Put simply, in conventional statistics, a correlation between any two variables—say a treatment and an outcome—may not necessarily represent a causal relationship, but may be confounded by unobserved or unobservable differences that create a spurious relationship.
For example, suppose a researcher is trying to determine if furnishing free school lunches improves student performance as measured by grades or any other outcome of choice.
A conventional statistical test using existing data that finds a positive correlation may be confounded by the fact that, for example, pupils who are more studious and better motivated might be more likely to opt for a free school lunch rather than skipping it to play with their friends in the school grounds.
The correlation could conceivably represent, then, what statisticians call reverse causality, running from motivation and disposition to earn higher grades into opting for lunch, rather than the reverse.
An RCT, at least in theory, solves this problem, since, by construction, the treatment and control groups are otherwise identical on average—say in motivation and ability—and the only difference between them is that one group is offered a school lunch, while the other is not.
A positive correlation, in this case, could legitimately be interpreted as supporting a causal relationship. What is more, it could be used to argue in favour of a public policy intervention to offer school lunches on the public dime—as has, in fact, happened in various places.
Such claims then form the basis for the revolution argued for by Banerjee, Duflo and others. But how tenable are these claims? Not very, when they are probed. In a long review article, economist Pranab Bardhan offers a scathing critique of the Banerjee-Duflo approach, identifying five generic methodological issues that call into question the relevance of findings derived from RCTs as used in economics and public policy. Note that these objections apply to RCTs broadly, and not just the work of these particular economists.
First, despite the best of intentions, the assignment of individuals to treatment and control groups may not be truly random. This could be due to what Bardhan terms impurities in design, implementation and participation in the study.
Second, the questions that RCTs address are, by definition, limited in scope. One cannot construct an RCT to determine, for instance, if the central bank should raise interest rates by 50 or 100 basis points in the face of inflation rising a percentage point above its target. (One basis point is one-hundredth of a percentage point.)
That problem requires a conventional economic model and the application of conventional statistical techniques to existing data. Relatedly, a treatment may have economic, political or other spillover effects that go unseen and unmeasured at the micro level, but would have a significant, perhaps deleterious, impact, if an intervention is scaled up at the national or global level.
Third, by construction, RCTs can only demonstrate an average treatment effect, which may mask considerable variation among individual test subjects. For example, a school lunch might have had a negligible impact on one student’s performance and a stellar impact on another, with the average falling somewhere in between.
This makes it difficult to do a cost-benefit analysis on whether such an intervention would make sense. Further, it scants the distributional impacts of a policy change—which, as Bardhan correctly reminds us, is one of the evergreen questions of political economy.
Fourth, even if an RCT is able to establish causality, the experiment itself is, by construction, unable to determine the causal mechanism. Thus, the researcher must offer an account of the finding that may or may not be plausible and does not necessarily derive in any rigorous fashion from the experimental design itself.
It would be somewhat akin to discovering that a drug miraculously cures a treatment group but not understanding the biochemistry of why it works. Regulators would be wary in such a case to prescribe it for use, without understanding the mechanism through which it operates.
This is where economic theory—which elucidates causal mechanisms in a rigorously specified model—must, perforce, play a role, something that proponents of RCTs tend to discount. Policy analysis and advice cannot be theory-free.
Fifth—and this is the angle that the rest of this piece will explore in greater depth—there is the key issue of what is known in statistics as generalizability or external validity—what one might call the question of wider applicability in layman’s terms.
In a nutshell, this literature asks whether the results of an RCT can be extended with confidence to other places and times than where the original experiment took place.
In a letter to the editor commenting on The Economist’s gushing leader, my brother Rajeev Dehejia, a public policy professor at New York University, and I pointed to this factor as, in our judgment, crucially limiting the utility of RCTs in public policy.
The reason that external validity is not such a problem in medicine, where RCTs originated, is that natural scientists have a relatively well-developed model of how the human body functions at the molecular level. Further, there is a fair degree of confidence that, at this molecular level, which is the level at which, for instance, pharmaceutical interventions operate, humans are essentially alike, or, at any rate, that the differences are small enough to be ignored.
By contrast, economists do not have an agreed upon theory of the sorts of social interactions that RCTs test, and there is good reason to believe that there might be unobserved and even unobservable differences between the original site of an experiment and possible future uses in other places and times.
Obvious sources of such possible differences are the overall economic environment, history, culture, religion and institutions, among a host of possibilities. This renders it treacherous to attempt to generalize the findings of an RCT, or even a small number of related RCTs, performed in specific locations and places. As far as public policy goes, this could be an insuperable difficulty.
Studying the problem of external validity in the context of RCTs and of natural experiments, which are closely related, is at the frontier of current scientific research in economics, political science and public policy. A recent blog post by Rajeev Dehejia, Columbia public policy professor Cristian Pop-Eleches and NYU political science professor Cyrus Samii draws on their ongoing research and highlights the potential difficulties.
Their first finding is that when the results of a natural experiment (or, by extension, an RCT) are extended from their original context to other times and places, the treatment effect may be smaller or larger in magnitude, may disappear altogether, or may even work in the opposite direction of what it did in the original study. This would, obviously, make it difficult, if not impossible, to draw sensible public policy implications of any general applicability.
Their second finding is that adding additional control variables on top of those used in the original study will, generally, reduce prediction error and improve predictive power in settings other than the original one. As matches common sense, more data is, other things being equal, better than less data. But—and this is crucial—there is no automatic or mechanical sense in which more data necessarily improves predictive power in a linear fashion.
In some contexts, no matter how much more data is added, prediction error converges, but not necessarily to zero. In other words, it’s not always possible to drive prediction error down to zero by piling on more data, although more data, almost by definition, cannot hurt.
The nuanced bottom line finding from the Dehejia-Pop-Eleches-Samii study is that external validity is not a yes-no question. In other words, in some settings, the results of RCTs are generalizable, while in other settings, they are not. Context matters.
Other important and ongoing recent research confirms the larger message that external validity is not a trivial problem for RCTs. Such research also gives us good reason to temper our enthusiasm when it comes to imagining that RCTs may be a magic bullet in assessing and formulating public policy.
A 2013 research paper by economists Lant Pritchett of the Harvard Kennedy School and Justin Sandefur of the Centre for Global Development, a think tank, argues that conventional statistical results from non-experimental studies in the setting in question are, at present, a better guide to likely impacts than experimental results derived from other settings.
In other words, despite their well-known limitations, conventional statistical methods in the right setting are better than theoretically superior RCT results from the wrong setting. This finding bears directly on the likely lack of external validity of RCTs and other natural experiments; it is just not that easy to transplant research results from one setting to another.
A 2014 research paper—a 2012 version is available here—by NYU economics professor Hunt Allcott explores the problem of site selection bias. This is a situation in which the adoption or evaluation of a particular programme is correlated with the treatment under study, which again casts doubt on the external validity of findings so contaminated.
For instance, a school district that signs up for testing of the efficacy of school lunches might be predisposed to providing them, and test subjects in the district might be primed to be more responsive to the treatment. In such cases, a strong and positive finding would tend to overstate the true treatment effect across a wider population because the sites that are selected for treatment are not truly random.
A 2015 research publication by consumer science professor Jonathan Bauchet of Purdue, public policy professor Jonathan Morduch of NYU, and economist Shamika Ravi of Brookings India, a think tank, highlights yet another mechanism through which external validity is likely to fail.
While an RCT may elucidate a specific impact of a given treatment, that direct impact will interact with the overall economic environment, in particular, with economic opportunities outside the programme under study. This will mean that the net, rather than direct, impact of an RCT is likely to differ widely across time and space, again making external validity a tricky proposition.
Thus, for instance, a school lunch programme will likely have different effects, depending on whether cheap and nutritious food is readily available privately to students—say at nearby food stalls—or whether the school is isolated and a cafeteria lunch is the only viable option.
In an important and recent contribution to the debate, Stanford University economics professor Eva Vivalt attempts to assess whether impact evaluations of studies coming from RCTs or natural experiments are at all worthwhile. She does this by considering the cost of such evaluations as against the improved predictive power they give, versus a naive or untutored prior assumption or guess by a policymaker.
She finds that if policymakers use a naive guess about a treatment effect, they would be off target by about 97%. Incredibly, performing an impact evaluation only improves the predictive accuracy by a meagre 0.1-0.3 percentage points. Given that such evaluations are costly, it’s not at all clear that they are worthwhile, given how little they appear to improve predictive power as against a policymaker just making their best guess.
Despite what the reader may think at this juncture, the bottom line is not that RCTs are useless. To the contrary, as the research discussed above makes clear, we need more and better RCTs, to improve upon and generalize their findings and determine the range of their applicability, and to better integrate the analytical framework of RCTs with existing economic theory and empirical methods.
Equally, we must be wary not to see RCTs as a social science and public policy revolution in the making, as some of their proponents appear to argue in their quite understandable zeal at having discovered an important new tool to add to our kit. A new tool is always welcome, but not at the expense of discarding well-established and well-understood classical tools of economic theory, statistical analysis and public policy evaluation.
Economics Express runs weekly, and features interesting reads from the world of economics and finance.
Click here to view full article