Development economics has embraced randomized controlled trials as the gold standard for understanding what works. We've accumulated thousands of rigorous evaluations across dozens of countries. But here's the uncomfortable question that haunts every policymaker trying to scale a promising intervention: will this work here?

A deworming program that transformed school attendance in rural Kenya. A microcredit scheme that boosted incomes in Bangladesh. A conditional cash transfer that improved child nutrition in Mexico. Each backed by rigorous evidence. Each tempting to replicate elsewhere. But the leap from experimental context to new setting is treacherous terrain.

This is the external validity problem—the challenge of determining whether findings from one population, time, and place will hold in another. It's not merely an academic concern. Billions of development dollars flow toward interventions based on evidence generated elsewhere. Getting external validity wrong means wasting resources, missing opportunities, and sometimes causing harm. The good news: we're developing frameworks to navigate this challenge systematically rather than relying on intuition alone.

PICO Framework Adaptation: Systematic Assessment of Generalizability

Clinical medicine faced the external validity challenge decades before development economics. Their solution—the PICO framework—offers a structured approach we can adapt. PICO stands for Population, Intervention, Comparison, and Outcome. Each element requires explicit comparison between the study context and the target context.

Population assessment goes beyond demographics. Yes, age, gender, and income matter. But so do less visible characteristics: baseline health status, prior exposure to similar programs, social networks, trust in institutions, and existing knowledge. A financial literacy program tested on farmers with no prior banking experience will likely perform differently among urban workers with existing bank accounts.

Intervention analysis demands brutal honesty about implementation capacity. The program evaluated in the study may have been delivered by highly trained researchers with generous budgets. Will your government partners achieve the same fidelity? Can you source the same materials? Will the intervention maintain its essential features when adapted for local languages and customs?

Comparison conditions shape effect sizes dramatically. An intervention compared against nothing looks better than one compared against existing services. If the study context had no government programs while your target context has functioning alternatives, the marginal impact of your intervention shrinks. The relevant question isn't 'does this work?' but 'does this work better than what's already available?'

Outcome definitions require careful scrutiny. A program might improve test scores in one context but not another—not because it failed, but because the tests measure different skills. Labor market interventions evaluated by employment rates might show different results in contexts with varying definitions of 'employment' or different baseline rates. Mapping outcomes across contexts is harder than it appears.

Takeaway

External validity assessment isn't about asking whether contexts are 'similar enough'—it's about systematically comparing populations, interventions, comparisons, and outcomes to identify where differences might alter results.

Surface vs Deep Similarity: What Actually Determines Transferability

Two villages might look identical—same crops, same income levels, same distance from markets—yet respond completely differently to the same agricultural extension program. Meanwhile, an urban slum in Mumbai and a rural community in Guatemala might respond similarly to a sanitation intervention despite surface dissimilarity. The key is understanding mechanisms.

Surface similarity deceives. Two contexts might share demographic profiles but differ in ways that fundamentally alter how an intervention works. Social norms around gender might determine whether women can access a savings program. Trust in local leaders might determine whether communities adopt new health practices. Political structures might determine whether infrastructure investments are maintained. These factors rarely appear in study reports but often drive results.

Mechanism identification is the crucial skill. When you read an evaluation, ask: through what process did this intervention produce effects? A conditional cash transfer might work by relaxing budget constraints, or by signaling government attention, or by creating aspirational targets. The same measured outcome can flow from different mechanisms—and different mechanisms transfer differently.

Consider microcredit. Early evaluations in Bangladesh showed positive effects. Replications in other contexts often failed. Why? The original context featured strong social networks that enforced repayment through reputation effects. The intervention worked partly through social pressure mechanisms that didn't exist in more atomized urban settings. The surface features—loan size, interest rates, repayment schedules—were similar. The deep mechanism was absent.

Developing mechanism maps helps identify transfer risks. For any intervention, list the causal steps between program delivery and outcome. Then assess which steps depend on context-specific features. If a nutrition education program works because mothers already trust health workers, it may fail in contexts where that trust is absent—regardless of how similar the malnutrition rates appear.

Takeaway

Contexts that look alike often behave differently, while dissimilar contexts sometimes respond identically—the difference lies in whether the causal mechanisms driving results are present, not whether surface characteristics match.

Building Transportable Evidence: Design Strategies for Generalizability

We can't test every intervention in every possible context. But we can design evaluations that produce more transportable knowledge. This requires shifting from 'what works here' to 'under what conditions does this work'—a subtle but transformative reframing.

Mechanism-focused evaluation should become standard practice. Beyond measuring final outcomes, evaluations should measure intermediate steps in the causal chain. If a training program is supposed to work by improving knowledge, then changing behavior, then improving income, measure all three. This allows future users to assess which links might break in their context.

Heterogeneity analysis reveals the conditions under which effects vary. Rather than reporting average treatment effects, evaluations should systematically explore how effects differ by subgroup characteristics. Does the program work better for women than men? For younger participants? For those with higher baseline education? These modifiers often predict whether an intervention will transfer to contexts with different population compositions.

Multi-site trials build in variation from the start. When the same intervention is tested simultaneously across diverse contexts using common protocols, we learn directly about generalizability rather than inferring it later. The BRAC Ultra-Poor Graduation Program evaluations across seven countries exemplified this approach—revealing both remarkable consistency in core effects and predictable variation in specific outcomes.

Finally, structured speculation should accompany every evaluation report. Researchers should explicitly state what features of their context they believe were essential to results, and what alternative contexts might or might not see similar effects. This disciplined conjecture, grounded in their deep contextual knowledge, provides valuable guidance that raw data cannot offer. We're not asking researchers to guarantee external validity—we're asking them to share their informed hypotheses about boundaries.

Takeaway

Transportable evidence doesn't emerge from better studies in single contexts—it requires deliberate design choices that prioritize mechanism identification, heterogeneity analysis, and structured speculation about generalizability conditions.

External validity will never be fully solved. Contexts are irreducibly unique in some respects, and we cannot run infinite trials. But we can move from naive optimism—assuming that evidence travels freely—to disciplined assessment of when and why it might not.

The frameworks here—systematic PICO comparison, mechanism mapping, and transportability-focused design—won't give you certainty. They'll give you informed uncertainty. You'll know what you don't know. You'll identify the key assumptions required for evidence to transfer. You'll design smarter pilots in new contexts.

The alternative is worse: billions spent on programs that worked somewhere but fail here, or paralysis in the face of imperfect evidence. We can do better than both. External validity assessment should become as standard as p-values and confidence intervals—a routine discipline rather than an afterthought.