Development evaluation has a measurement problem, and it's not the one you'd expect. The challenge isn't that we can't measure things—it's that we've become exceptionally good at measuring the wrong things. We build elaborate data collection systems, design sophisticated survey instruments, and deploy rigorous experimental methods, only to train all of that firepower on outcomes that tell us remarkably little about whether anyone's life actually improved.
Consider a familiar pattern. A program increases school enrollment by fifteen percentage points. Evaluators declare success. Policymakers scale the intervention. Yet five years later, labor market outcomes for treated populations look no different from controls. The enrollment gains were real. The learning gains were not. And the welfare improvements we implicitly assumed would follow from enrollment never materialized. The proxy ate the outcome.
This isn't a niche methodological concern—it's a first-order problem that distorts how billions of development dollars get allocated. When we select outcomes because they're easy to measure, likely to show effects, or conventionally expected, we create a systematic wedge between what evaluations tell us and what programs actually accomplish. The result is an evidence base that looks rigorous on its face but may be quietly misleading us about which interventions genuinely improve human welfare. Getting outcome selection right isn't a technical footnote. It's the foundation on which everything else in impact evaluation rests.
The Proxy Trap: When Measured Outcomes Diverge from Welfare
Development evaluation relies heavily on proxy outcomes—measurable indicators assumed to correlate with the welfare improvements we ultimately care about. School enrollment stands in for learning. Clinic visits stand in for health. Microfinance uptake stands in for poverty reduction. The logic seems straightforward: if we move the proxy, the underlying welfare should follow. But the empirical record is littered with cases where this assumption catastrophically fails.
The education sector offers perhaps the starkest illustrations. Conditional cash transfer programs across Latin America and Sub-Saharan Africa have reliably increased enrollment and attendance. Yet rigorous assessments of learning outcomes—actual test performance, literacy rates, cognitive development—frequently show minimal or zero effects. Children are in school more often, but they aren't learning more. The proxy moved; welfare didn't. Similar disconnects appear in health programming, where increased facility utilization doesn't necessarily translate to improved health status when service quality remains poor.
The mechanism behind this divergence is important to understand. Proxies and welfare outcomes decouple when the production function between them is nonlinear, conditional on complementary inputs, or subject to threshold effects. Enrollment produces learning only when combined with adequate teaching quality, materials, and classroom conditions. Without those complements, the proxy becomes what economists call a sufficient statistic for nothing—it responds to the intervention but carries no information about the outcome we actually value.
This problem is compounded by what we might call proxy lock-in. Once an indicator becomes the standard metric for a sector—enrollment for education, bed net ownership for malaria—institutional incentives align around moving that number. Program designers optimize for it. Funders evaluate against it. The indicator takes on a life of its own, independent of its relationship to welfare. Goodhart's Law operates with particular force in development: when a measure becomes a target, it ceases to be a good measure.
The practical consequence is that our evidence base systematically overstates the welfare impact of interventions that are good at moving proxies and understates the impact of interventions that affect welfare through channels proxies don't capture. A program that modestly improves consumption but doesn't affect enrollment will look like a failure by standard metrics, while a program that dramatically increases enrollment without affecting learning or earnings will look like a success. The measurement tail wags the policy dog.
TakeawayA proxy outcome is only as useful as its demonstrated causal relationship to the welfare gain you actually care about. If you haven't validated that link empirically, your evaluation may be measuring movement, not progress.
Outcome Shopping: How Researcher Discretion Distorts the Evidence Base
Beyond the proxy problem lies a subtler but equally consequential issue: outcome selection bias at the researcher level. In a typical development RCT, data collection instruments capture dozens, sometimes hundreds, of variables. The decision about which of these to designate as primary outcomes—and which to relegate to secondary or exploratory status—often happens not at the design stage but during analysis, after researchers have seen the data. This discretion creates enormous scope for presenting favorable results.
The statistics are sobering. Studies examining pre-analysis plans against published results in development economics find significant divergence between pre-specified primary outcomes and those ultimately emphasized in papers. Outcomes that show statistically significant effects get promoted; null results get buried in appendices or dropped entirely. This isn't necessarily deliberate fraud—it often reflects genuine uncertainty about which outcomes matter most, resolved post hoc in a direction that happens to favor publication. But the aggregate effect on the evidence base is corrosive.
The problem intensifies with multiple hypothesis testing. With twenty outcome variables and a five percent significance threshold, you expect one significant result by chance alone. Without rigorous correction for multiple comparisons—and many published studies apply corrections inconsistently or not at all—the literature accumulates false positives that look like genuine program effects. A deworming program might show no effect on school attendance, health status, or nutritional outcomes, but show a significant effect on one subscale of a cognitive assessment. That subscale becomes the headline result.
Pre-analysis plans represent the field's primary defense against this problem, and their adoption in development economics has been a genuine methodological advance. But they're not a complete solution. Plans filed before data collection can still specify an unreasonably large number of primary outcomes, diluting the discipline they're meant to impose. More fundamentally, pre-analysis plans address researcher discretion without addressing the deeper question of whether the right outcomes were selected in the first place. A perfectly pre-specified evaluation of the wrong outcomes is still measuring the wrong thing.
The institutional incentives reinforce the problem. Journals reward novel, significant findings. Funders want evidence their programs work. Implementing organizations need positive evaluations to maintain donor support. At every stage, the incentive structure pushes toward selecting and emphasizing outcomes where effects appear, rather than outcomes where effects matter. Addressing outcome selection bias requires not just methodological tools like pre-registration and multiple testing corrections, but a cultural shift in how we define evaluative success—away from statistical significance and toward demonstrated welfare relevance.
TakeawayThe freedom to choose which outcomes to emphasize after seeing results transforms rigorous experiments into something closer to data mining. Pre-registration helps, but only if the pre-specified outcomes are the ones that actually matter for welfare.
Toward Meaningful Measurement: Outcomes That Track Welfare
If standard proxies mislead and researcher discretion distorts, what should evaluators actually measure? The answer requires returning to first principles about what development interventions are ultimately trying to achieve. Most programs aim, directly or indirectly, at improving welfare—the material conditions, capabilities, and experienced quality of life of target populations. Outcomes should be selected for their demonstrated proximity to welfare, not their convenience or conventional status.
Consumption and expenditure remain among the most reliable welfare indicators in development contexts, precisely because they aggregate across the many channels through which a program might affect a household. A cash transfer program that increases consumption tells you something definitive about welfare, regardless of whether the mechanism runs through better nutrition, reduced stress, improved housing, or some combination. Asset accumulation—livestock, durable goods, savings—provides complementary evidence on whether welfare gains are durable rather than transitory. These measures are harder and more expensive to collect than enrollment figures or clinic visit counts, but they're measuring something real.
Subjective wellbeing measures have gained traction in development evaluation, and for good reason. Life satisfaction, experienced affect, and self-reported economic status capture dimensions of welfare that consumption data misses—social standing, psychological security, agency, hope. Critics rightly note measurement challenges: reference point effects, adaptation, cultural variation in reporting norms. But these concerns apply to any outcome measure. The question isn't whether subjective wellbeing is a perfect welfare indicator—nothing is—but whether it adds information beyond what consumption and asset data provide. The evidence suggests it does, particularly for interventions operating through psychological or social channels.
The most robust evaluation designs employ families of outcomes organized by their theoretical proximity to welfare. Primary outcomes should be direct welfare measures—consumption, health status, subjective wellbeing. Secondary outcomes include intermediate indicators that have well-established empirical links to welfare in the specific context. Exploratory outcomes can include proxies and process measures useful for understanding mechanisms, but these should never be confused with evidence of impact. This hierarchy, specified before data collection, provides both discipline and transparency.
Practical implementation demands trade-offs. Welfare outcomes require longer follow-up periods—you can measure enrollment immediately, but consumption effects may take years to materialize. They demand larger sample sizes, since effect sizes on ultimate outcomes are typically smaller than on proxies. They cost more. But these are arguments for better-resourced evaluation, not for settling for the wrong outcomes. The development sector spends billions on programs and relatively little on rigorously evaluating whether those programs improve lives. Investing more in measuring what actually matters isn't a luxury—it's the only way to ensure the rest of that spending isn't wasted.
TakeawayChoosing outcomes that genuinely track welfare—consumption, assets, subjective wellbeing—costs more and takes longer than measuring convenient proxies. But an evaluation that rigorously measures the wrong thing produces evidence worth less than no evidence at all.
The outcome measurement problem in development evaluation isn't primarily technical—it's conceptual. We have the tools to measure rigorously. What we lack is sufficient discipline in asking what to measure rigorously. The gap between proxy outcomes and welfare outcomes isn't a minor calibration issue; it's a potential inversion of our entire understanding of what works.
Closing that gap requires changes at every level: researchers who pre-commit to welfare-relevant outcomes, funders who finance the longer timelines and larger samples those outcomes demand, and journals that value null results on meaningful outcomes over significant results on convenient ones. The incentive structure must reward answering the right question, not just answering a question cleanly.
Development evaluation at its best is a tool for learning what actually improves human lives. That ambition is undermined every time we substitute a measurable proxy for a meaningful outcome. The evidence base we build is only as valuable as the outcomes it's built on.