Development economics has made extraordinary progress in establishing whether interventions work. Randomized controlled trials have moved us from ideology-driven programming to evidence-based policy. But there is a persistent gap in the evidence architecture that limits the practical utility of even the most rigorous impact evaluations: we frequently know that something worked without understanding why it worked, how it worked, or what it would take to make it work again somewhere else.
This gap is not a minor methodological quibble. It is a fundamental constraint on the scalability and transferability of development evidence. When a conditional cash transfer program reduces school dropout by 15 percent in rural Mexico, policymakers in Bangladesh need more than a treatment effect estimate. They need to understand which components of the program drove the result, how local implementation shaped outcomes, and what would happen if delivery conditions differed from the original trial context.
Process evaluation—the systematic investigation of implementation fidelity, delivery mechanisms, and causal pathways—is the methodological complement that transforms impact evidence from an interesting finding into actionable design guidance. Yet it remains remarkably underinvested in development research. The field has built sophisticated machinery for estimating average treatment effects while leaving the explanatory infrastructure largely unbuilt. This article argues that impact evaluations without accompanying process evaluation produce evidence of limited utility, and offers frameworks for integrating the two in ways that strengthen both.
Beyond Black Box Evaluation
The canonical impact evaluation answers a deceptively simple question: did the intervention change the outcome of interest? The RCT, when properly designed and implemented, provides a credible estimate of the average treatment effect. This is valuable. It disciplines the field against wishful thinking and forces programs to demonstrate measurable results. But it produces what methodologists call black box evaluation—evidence that something happened without illumination of the mechanism.
Consider the practical limitations this creates. A microfinance program is evaluated and found to increase household consumption by 8 percent. The impact estimate is credible, the confidence intervals are tight, and the result is published in a top journal. But the estimate alone cannot tell you whether the effect came from the capital itself, from the discipline of repayment schedules, from peer group dynamics, from the financial literacy training bundled into the loan product, or from some interaction of all four. Each of these pathways implies a radically different program design for replication.
The problem compounds when results are null or negative. A null finding from a black box evaluation is nearly uninterpretable. Did the intervention fail because the theory of change was wrong? Because implementation was poor? Because the dosage was insufficient? Because contextual factors overwhelmed the treatment? Without process data, you cannot distinguish between a bad idea and a good idea badly executed—a distinction that matters enormously for program design.
This limitation becomes acute precisely when evidence is most needed: at the point of scaling. Scaling decisions require understanding which program components are essential and which are incidental, how sensitive outcomes are to implementation quality, and where adaptation is possible without compromising effectiveness. Treatment effect estimates alone provide almost no guidance on these questions. They tell you what happened in a specific context under specific implementation conditions, but they do not decompose the result into its constituent causes.
The field's heavy investment in internal validity—getting the causal estimate right—has not been matched by equivalent investment in explanatory validity—understanding why the estimate takes the value it does. This asymmetry produces a growing library of credible effect sizes with limited accompanying theory about mechanisms, an evidence base that is rigorous but often practically thin.
TakeawayA treatment effect without a causal explanation is a fact without a theory—it tells you what happened once, but not what to expect when conditions change.
Process Evaluation Components
Process evaluation is not a single method but a structured investigation across three domains: implementation fidelity, delivery documentation, and mechanism investigation. Each addresses a distinct dimension of the explanatory gap, and each requires deliberate design and resourcing from the earliest stages of the evaluation.
Implementation fidelity measurement answers the question: was the program delivered as designed? This involves specifying the intended treatment in granular detail—dosage, frequency, content, provider qualifications, targeting criteria—and then measuring actual delivery against this specification. Fidelity data transforms impact evaluation from an assessment of the intended treatment to an assessment of the received treatment. In practice, the gap between these two can be enormous. Teacher training programs where teachers attend only half the sessions, health interventions where community health workers skip key counseling modules, agricultural extension services where field agents visit half as often as planned—fidelity failures are not exceptions, they are the norm. Without measurement, they are invisible.
Delivery documentation captures the operational reality of implementation: supply chain performance, staff turnover, administrative bottlenecks, community responses, political dynamics, and the countless adaptations that frontline workers make in response to local conditions. This qualitative and operational data provides the contextual narrative that makes quantitative findings interpretable. It also generates the practical knowledge that implementing organizations actually need—not whether a program works in theory, but what it takes to make it work in practice.
Mechanism investigation is the most analytically demanding component. It involves identifying the hypothesized causal pathways through which the intervention is expected to produce outcomes, specifying observable implications of each pathway, and collecting data that can distinguish between competing explanations. This goes beyond simple mediation analysis. It requires pre-specifying the theory of change in sufficient detail to be testable, measuring intermediate outcomes along the causal chain, and using both quantitative and qualitative methods to triangulate evidence about which pathways are active.
Together, these three components transform impact evaluation from a verdict into a diagnosis. Fidelity data tells you whether the treatment was administered correctly. Delivery documentation tells you what it took to administer it. Mechanism investigation tells you how it produced its effects. None of this undermines the impact estimate—it enriches it, contextualizes it, and makes it useful for the decisions that actually matter: whether to replicate, how to adapt, and where to invest next.
TakeawayProcess evaluation has three irreducible components—fidelity, delivery, and mechanism—and each answers a question that impact estimates alone cannot touch.
Integration with Impact Evaluation
The most common mistake in process evaluation is treating it as an afterthought—something bolted onto an impact evaluation after the trial is designed, funded, and underway. Effective process evaluation must be integrated from inception, embedded in the theory of change, the evaluation design, the data collection instruments, and the budget. Retrofitting it produces thin, unsystematic data that cannot bear analytical weight.
Integration begins with the theory of change. Before randomization, the evaluation team should articulate the causal chain in testable detail: what inputs are delivered, what behaviors or conditions they are expected to change, through what mechanisms, and under what assumptions. Each link in this chain becomes a measurement point. This exercise disciplines the impact evaluation itself—it forces clarity about what the treatment actually is, which is frequently more ambiguous than evaluators acknowledge.
The sampling and data collection strategy must accommodate process evaluation needs. This typically means embedding qualitative data collection—interviews, observation, focus groups—within the trial structure, collecting administrative and operational data from implementing partners on a continuous basis, and measuring intermediate outcomes that lie along the causal chain. The marginal cost of adding these data streams to an existing evaluation is modest relative to the total evaluation budget. The marginal information value is enormous.
A practical framework for integration involves three phases. Pre-implementation: specify the theory of change, define fidelity benchmarks, and design data collection protocols for process indicators. During implementation: collect real-time fidelity and delivery data, conduct ongoing qualitative investigation, and use process findings to document—not correct—implementation variation. Post-implementation: analyze process data alongside impact estimates, use fidelity variation to examine dose-response relationships, and test hypothesized mechanisms against observed patterns in intermediate outcomes.
The critical discipline here is that process data should document variation, not eliminate it. The temptation is to use process monitoring to improve implementation during the trial—to ensure fidelity, correct deviations, and optimize delivery. This is valuable for program management but destructive for evaluation. Natural variation in implementation quality and fidelity is analytically precious: it allows you to examine whether outcomes track implementation intensity, which components appear to drive effects, and how sensitive results are to delivery conditions. Protect this variation. It is the raw material from which explanatory evidence is built.
TakeawayProcess evaluation is not an add-on to impact evaluation—it is a co-designed complement that must be built into the theory of change, the data architecture, and the budget from day one.
The development evaluation field has achieved something remarkable: a credible, growing body of evidence about what works. But credible estimates without explanatory depth produce an evidence base that is paradoxically rigorous and incomplete—answering whether interventions work while remaining largely silent on why, how, and under what conditions.
Process evaluation is the methodological investment that closes this gap. It transforms treatment effects from isolated findings into transferable knowledge. It provides the implementation intelligence that policymakers and practitioners need to adapt, replicate, and scale. And it does so at marginal costs that are small relative to the analytical returns.
The standard should shift. An impact evaluation without process evaluation should be viewed not as complete but as partial—a necessary but insufficient contribution to the evidence base. Funders, journals, and evaluation designers all have roles in establishing this norm. The question is not whether we can afford to invest in process evaluation. It is whether we can afford the evidence waste that results from leaving it out.