Impact evaluation has become the gold standard in development. Randomized controlled trials, difference-in-differences, regression discontinuity—these methods have reshaped how we think about what works. And for good reason. They brought rigour to a field once dominated by anecdote and assumption.

But somewhere along the way, a subtle overreach occurred. The tools designed to answer specific questions started being treated as though they could answer all questions. Funders began demanding impact evaluations for programs where the method didn't fit. Policymakers waited for RCT evidence before acting on problems where waiting itself was the failure.

The result is a paradox: the very success of impact evaluation has created blind spots. Understanding what these methods can and cannot tell us isn't a weakness—it's a prerequisite for using evidence well. The most rigorous tool in our kit is only as good as our judgment about when to reach for it.

What We Can Learn

Impact evaluations are designed to answer a deceptively simple question: did this specific intervention cause a measurable change in a defined outcome? That's it. Not whether the change was worth the cost. Not whether the program should be scaled. Not whether it was the best use of resources. Just: did it work, compared to the counterfactual?

When the question is well-matched to the method, the results can be powerful. Esther Duflo and Michael Kremer's work on deworming in Kenya, for instance, established a causal link between treatment and school attendance that reshaped global health priorities. The strength of the finding came from the method's narrow focus—one intervention, one outcome, one context, carefully measured.

RCTs and quasi-experimental designs are particularly effective for testing discrete, standardised interventions with clearly measurable short-to-medium-term outcomes. Distributing bed nets, providing conditional cash transfers, testing different fertiliser subsidies—these are the sweet spot. The intervention can be randomised, the comparison group is meaningful, and the outcome can be observed within a study timeline.

This precision is the method's greatest asset. But precision about a narrow question is different from comprehensive understanding. Knowing that a program increased test scores by 0.2 standard deviations tells you something real. It doesn't tell you whether those score gains translated into better livelihoods, whether the program would work in a different district, or whether the gains persisted after the researchers left.

Takeaway

Impact evaluations answer whether a specific thing worked in a specific place. Treating that answer as a universal verdict—rather than a single data point in a much larger puzzle—is where evidence-based practice goes wrong.

What Remains Unknown

The list of things impact evaluations struggle with is long and consequential. Start with complex, system-level change. Governance reform, institutional capacity building, shifting social norms—these don't lend themselves to randomisation. You can't randomly assign countries to adopt decentralisation policies. You can't create a control group for a national anti-corruption campaign.

Then there's the problem of external validity—the question of whether results from one context apply elsewhere. An education intervention that works in rural India may fail in urban Nigeria. The mechanisms that drove success might depend on local teacher culture, parental expectations, or infrastructure that the evaluation never measured. A 2017 review in the Journal of Development Economics found that effect sizes for similar interventions varied enormously across sites, sometimes even changing direction.

Impact evaluations also struggle with long time horizons. Most studies track outcomes for one to three years. But development is a generational process. Did that early childhood nutrition program improve adult earnings twenty years later? Did that community-driven development project change local governance culture? These questions matter enormously, but they're expensive and logistically brutal to study with experimental methods.

Perhaps most critically, impact evaluations cannot tell us why something worked or failed. They measure the net effect but often leave the causal mechanism—the story of how change happened—as a black box. Without understanding mechanism, replication becomes guesswork. You end up copying the visible features of a successful program while missing the invisible conditions that made it succeed.

Takeaway

The most important development questions—about systems, mechanisms, long-term change, and transferability—are precisely the ones that impact evaluations are least equipped to answer. Recognising this gap is not anti-evidence; it's pro-honesty.

Complementary Approaches

If impact evaluation is a spotlight—brilliant but narrow—then effective development research requires floodlights too. Process evaluation and qualitative research can open the black box of how programs work. Ethnographic observation, in-depth interviews, and case studies reveal the messy human dynamics that quantitative data flattens. They tell you that the cash transfer program succeeded partly because the village chief championed it, or that the health intervention failed because clinic staff resented the extra workload.

Contribution analysis and theory-based evaluation offer frameworks for assessing complex programs where randomisation is impossible. Rather than asking "did this cause that?" they ask "given everything we can observe, is it plausible that this program contributed to the change we see?" It's less definitive, but for systemic interventions—governance reform, market systems development, institutional strengthening—it's often the only honest approach available.

Comparative case studies and historical analysis help with the external validity problem. By studying how similar interventions played out across multiple contexts, researchers can identify the conditions under which programs succeed or fail. This doesn't produce the clean causal estimates of an RCT, but it produces something arguably more useful for decision-makers: a map of when and where an approach is likely to work.

The real sophistication lies in mixing methods deliberately. Use an RCT to establish whether an intervention has an effect. Use qualitative research to understand why. Use comparative analysis to judge whether the findings travel. Use cost-effectiveness analysis to decide whether it's worth doing. No single method carries the full burden of proof—and no single method should.

Takeaway

The strongest evidence base isn't built from one method applied everywhere—it's built from multiple methods applied thoughtfully. The question isn't "do we have an RCT?" but "do we have the right kind of evidence for the decision we need to make?"

Impact evaluation earned its authority by bringing discipline to development. That authority is worth protecting—which means being honest about its boundaries. Overselling the method ultimately undermines it.

The best development practitioners don't ask whether a program has been rigorously evaluated. They ask what kind of evidence exists, what questions it answers, and what remains uncertain. That's a harder discipline than demanding an RCT for everything, but it's closer to how good decisions actually get made.

Evidence-based development isn't about having the perfect study. It's about assembling the best available knowledge—quantitative and qualitative, experimental and observational—and making honest judgments under uncertainty. That's not a limitation. That's the work.