The randomized controlled trial has become the gold standard for evaluating development interventions—but that standard was forged in contexts where researchers can return to the same village next quarter, where enumerators travel freely, and where the population you randomized at baseline still lives where you found them. Fragile and conflict-affected states violate every one of these assumptions. Yet these are precisely the settings where development spending is most concentrated and where getting program design wrong carries the highest human cost.
The temptation is to abandon rigor altogether in these environments—to fall back on process monitoring, anecdotal evidence, and the argument that something is better than nothing. This is understandable but dangerous. Fragile states absorb roughly a third of official development assistance globally, and the interventions deployed there—cash transfers during displacement, livelihood programs in post-conflict zones, community-driven reconstruction—desperately need causal evidence on what works. Without it, we recycle failed approaches and waste resources that displaced and war-affected populations cannot afford to see squandered.
The real question is not whether to conduct rigorous evaluation in fragile contexts, but how to adapt the methodology so that rigor survives contact with reality. This requires rethinking randomization strategies, data collection infrastructure, ethical frameworks, and the very timeline on which we expect to generate evidence. The adaptations are not trivial—they demand genuine methodological innovation, not just logistical workarounds.
Security and Access Constraints: When the Field Becomes a Moving Target
Standard RCT design assumes that treatment and control units remain accessible throughout the study period. In fragile states, this assumption collapses. Entire treatment arms can become unreachable when conflict flares, roads are mined, or government forces restrict humanitarian access. A randomized evaluation of agricultural inputs in South Sudan, for example, may lose half its sample villages to displacement within months of baseline data collection. The attrition is not random—it is systematically correlated with the very outcomes you are trying to measure.
Enumerator safety introduces a constraint that no institutional review board protocol in a stable-country university fully anticipates. Data collection teams in fragile settings face kidnapping risk, military checkpoints, and suspicion from armed actors who may view survey activities as intelligence gathering. This means survey rounds get delayed, shortened, or rerouted, introducing measurement error that compounds across waves. Some researchers have shifted toward phone-based surveys and remote sensing data to reduce physical exposure, but these substitutes carry their own biases—phone ownership is non-random, and satellite imagery cannot capture subjective welfare measures.
Population displacement is perhaps the deepest methodological challenge. When individuals flee across borders or relocate to IDP camps, tracking them for endline measurement becomes extraordinarily difficult and expensive. Differential attrition between treatment and control groups can destroy the internal validity that randomization was supposed to guarantee. If treated households are more likely to remain in place because the intervention itself provides an anchor—say, a functioning health clinic—then comparing stayers to stayers across arms no longer isolates the treatment effect.
Adaptive randomization offers partial solutions. Cluster randomization at higher geographic units—districts rather than villages—can reduce the probability that entire treatment clusters become inaccessible simultaneously. Stepped-wedge designs, where treatment rolls out sequentially across clusters, allow researchers to exploit whatever access windows emerge rather than requiring simultaneous baseline measurement across all sites. Waitlist designs, already common in development RCTs, become even more valuable when the question is not just fairness but logistical feasibility.
The harder truth is that some degree of methodological compromise is unavoidable. The goal becomes transparent documentation of threats to validity rather than pretending they do not exist. Bounding exercises—calculating worst-case and best-case treatment effects under different attrition assumptions—give policymakers honest intervals rather than false precision. This is less satisfying than a clean point estimate, but far more useful than abandoning measurement altogether.
TakeawayIn fragile contexts, the choice is rarely between perfect evidence and imperfect evidence—it is between imperfect evidence and no evidence at all. The methodological discipline lies in being honest about what your design can and cannot identify.
Rapid Context Changes: Designing for Instability Rather Than Against It
A well-designed RCT depends on a stable counterfactual. You measure outcomes for the control group and infer what would have happened to the treatment group absent the intervention. But in fragile states, the counterfactual itself is in constant motion. A ceasefire collapses, a new armed group enters the region, commodity prices spike due to blockade, a cholera outbreak sweeps through displacement camps. These macro-shocks affect treatment and control groups differently depending on geography, ethnic composition, and political alignment—violating the exchangeability that randomization is supposed to ensure.
The standard response—controlling for observables at baseline—fails when the relevant variables are themselves unstable. Household wealth measured before a displacement event tells you almost nothing about household wealth six months later. Baseline balance tables become artifacts of a world that no longer exists. Researchers working in eastern DRC, northern Nigeria, or Yemen have learned this through painful experience: the careful stratification variables chosen at design stage may be irrelevant by midline.
Adaptive evaluation designs represent the most promising methodological response. Bayesian adaptive trials, borrowed from clinical medicine, allow researchers to update randomization probabilities and sample sizes as data accumulates. If early results suggest that one treatment arm is clearly dominant or that a subpopulation is being harmed, the design can adjust without waiting for the pre-specified endline. This is not the same as abandoning the protocol—it requires pre-registration of adaptation rules and stopping criteria to prevent fishing.
Another approach involves embedding evaluation within implementation cycles rather than treating it as a separate research activity. Rapid-cycle evaluation—running short, sequential experiments on discrete program components rather than one large trial on the full package—allows learning to accumulate even as the context shifts. If a cash transfer program operates in a corridor that becomes insecure, you can still evaluate the delivery mechanism, targeting accuracy, or household spending patterns in the areas that remain accessible.
The deeper lesson is philosophical as much as technical. Researchers trained in stable settings often treat context change as contamination—something that ruins the experiment. In fragile states, context change is the experiment. Understanding how interventions perform when conditions deteriorate is arguably more policy-relevant than understanding how they perform under ideal implementation conditions. The external validity question flips: we need to know not whether a program works in the best case, but whether it survives the worst case.
TakeawayThe most policy-relevant question in fragile settings is not whether an intervention works under stable conditions, but whether it retains its effectiveness when conditions deteriorate—which means designing evaluations that treat instability as a feature, not a contaminant.
Ethical Intensification: Do No Harm When the Stakes Are Existential
Every development RCT raises ethical questions about withholding treatment from control groups. In fragile states, these questions become orders of magnitude more urgent. When the population is facing active violence, acute malnutrition, or displacement, assigning households to a control group that receives no intervention is not an abstract philosophical exercise—it may have life-or-death consequences. The standard justification that we do not yet know whether the intervention works, and therefore equipoise holds, becomes harder to sustain when the alternative to intervention is destitution.
The do-no-harm principle, already central to humanitarian ethics, acquires additional dimensions in research contexts. The act of data collection itself can cause harm. Surveying households about their displacement experiences, exposure to violence, or political affiliations can retraumatize respondents, expose them to suspicion from armed actors, or generate information that could be weaponized if data security is compromised. In contexts where ethnic or political identity is associated with targeting, even a household roster can become a dangerous document.
Informed consent—the bedrock of research ethics—faces structural challenges when populations are under duress. Can consent be truly voluntary when a displaced household depends on the implementing organization for food and shelter? The power asymmetry between researcher and subject is magnified enormously when subjects are in survival mode. Institutional review boards in high-income countries often lack the contextual expertise to evaluate these dynamics, leading to either overly restrictive protocols that prevent useful research or rubber-stamp approvals that underestimate real risks.
Some researchers have responded by adopting participatory and community-based ethical review processes that complement formal IRB oversight. Engaging local leaders, civil society organizations, and affected communities in study design—not as subjects but as partners in determining acceptable risk—produces protocols that are both more ethically grounded and more practically implementable. This is slower and more expensive than standard IRB review, but it surfaces risks that no desk-based protocol can anticipate.
There is also an ethical argument for conducting rigorous evaluation in fragile states that is too rarely made. The absence of evidence is not ethically neutral. When programs are implemented at scale without evaluation, resources are allocated based on assumption, institutional inertia, and political convenience. The populations harmed by ineffective programs—those who receive poorly designed interventions instead of something better—are invisible casualties of our failure to generate evidence. Ethical rigor demands not just protecting research participants from study-related harm, but protecting future beneficiaries from the harm of perpetual ignorance about what works.
TakeawayThe ethical case for rigorous evaluation in fragile states is not weaker than in stable contexts—it is stronger, precisely because the cost of deploying ineffective interventions is measured in lives, not just wasted dollars.
Conducting rigorous impact evaluation in fragile states demands methodological humility and innovation in equal measure. The tools of experimental economics—randomization, controlled comparison, statistical inference—do not become irrelevant in these settings, but they must be wielded with a clear-eyed understanding of their limitations and adapted to contexts that resist the assumptions underlying textbook designs.
The adaptations outlined here—adaptive randomization, rapid-cycle evaluation, transparent bounding of validity threats, participatory ethical review—are not concessions to imprecision. They are methodological advances driven by the hardest possible testing ground. Evidence generated in fragile contexts, precisely because it survives so many threats to validity, often has greater external relevance than pristine trials conducted under ideal conditions.
The alternative—reserving rigorous evidence for stable, accessible settings and flying blind in the places where development spending is most concentrated—is intellectually indefensible and ethically untenable. The discipline must go where the need is greatest, even when the methods strain under the weight.