A village health program in Bangladesh shows remarkable results. Child mortality drops, vaccination rates soar, and the data looks bulletproof. International funders celebrate. The government commits to national rollout. Three years later, the same metrics barely budge at scale.
This pattern repeats across development work with disturbing regularity. Promising pilots become disappointing national programs. Rigorous evidence from controlled trials fails to translate into real-world impact. The interventions that worked so well in the study seem to lose their potency when they leave the laboratory conditions of the pilot.
The gap between pilot success and scale-up disappointment isn't random bad luck or poor execution. It reflects systematic forces that transform interventions as they expand. Understanding these forces—implementation intensity loss, context dependence, and organizational capacity gaps—reveals why scaling development programs requires fundamentally different thinking than designing effective pilots.
Implementation Intensity Loss
Pilots run hot. They concentrate resources, attention, and talent on a small number of beneficiaries. A pilot serving 500 households might have dedicated staff visiting weekly, supervisors checking quality daily, and researchers analyzing every data point. This intensity produces results.
When the same program scales to 50,000 households, the math becomes brutal. The budget per beneficiary drops. One community health worker now covers ten villages instead of one. Supervision becomes sporadic. The feedback loops that caught problems early in the pilot stretch thin or break entirely.
Consider what happens to training. Pilot programs often invest heavily in developing frontline workers—weeks of instruction, ongoing mentorship, refresher courses. At scale, training compresses into days. The people who trained pilot staff become trainers of trainers, and fidelity degrades at each remove. The third-generation trainees bear little resemblance to the original cadre.
The attention itself matters independently of resources. Pilot beneficiaries know they're being watched, measured, and studied. Staff know their performance reflects on a promising innovation. This Hawthorne effect disappears at scale, replaced by the anonymity of just another government program. The intangible intensity that animated the pilot evaporates.
TakeawayWhat works in a pilot often works partly because it's a pilot—the concentrated attention, resources, and motivation that make small-scale interventions successful are themselves part of the treatment, not just the delivery mechanism.
Context Dependence
Pilots happen somewhere specific. They emerge from particular relationships, respond to local conditions, and fit cultural contexts that researchers may not fully recognize or document. The success includes invisible factors that travel poorly.
A conditional cash transfer program works brilliantly in one region because local banks already have branches in rural areas. Expanding to a neighboring state with poor banking infrastructure requires entirely different delivery mechanisms. The intervention was never just about the money—it was about the money plus the existing financial ecosystem.
Human relationships embed themselves in successful pilots in ways that resist replication. The charismatic nurse who built trust with skeptical families, the community leader whose endorsement opened doors, the local government official who cut through bureaucratic obstacles—these people don't scale. Their equivalents elsewhere may not exist or may not be engaged.
Optimization compounds the problem. Pilots often iterate and adjust during implementation, fine-tuning the intervention to local conditions without documenting every adaptation. The final pilot design represents a solution optimized for that place and population. Lifting it wholesale to different contexts carries over the specificity while losing the optimization process that made it work.
TakeawaySuccessful pilots are often successful marriages between an intervention and a specific context—separating the intervention from that context and expecting the same results elsewhere ignores that the context was part of what worked.
Organizational Capacity Gaps
Pilots typically run through NGOs, research institutions, or specialized implementation units. These organizations select for mission alignment, pay competitive wages, and maintain focused mandates. They can fire underperformers, adjust protocols rapidly, and operate outside civil service constraints.
Government systems at scale work differently. Staff have permanent tenure regardless of performance. Salary structures lag private alternatives, selecting against the most capable candidates. Procurement rules slow adaptation. Political pressures distort priorities. The same program design encounters an entirely different implementation environment.
The gap shows up in supervision. NGO supervisors often hold deep commitment to program outcomes and authority to enforce standards. Government supervisors navigate complex hierarchies, manage relationships with politically connected staff, and balance multiple programs simultaneously. The quality assurance that kept pilot implementation tight dissolves into bureaucratic routine.
Information systems illustrate the challenge starkly. Pilot programs often develop sophisticated monitoring—real-time data collection, rapid analysis, feedback to frontline workers. Government systems process information slowly through multiple levels. By the time problems surface in official reports, they've been ongoing for months. The responsiveness that characterized the pilot gives way to the inertia of large bureaucracies.
TakeawayScaling through government systems means accepting that the implementing organization fundamentally changes—the intervention must be robust enough to work despite weaker implementation, or the capacity-building becomes as important as the program itself.
These three forces—intensity loss, context dependence, and capacity gaps—don't operate independently. They compound each other. A program losing implementation intensity becomes more vulnerable to contextual misfits, while weak organizational capacity accelerates both problems.
The implication isn't that scaling should be abandoned. It's that scale-up requires different design principles than pilot success. Interventions need slack built in for lower-intensity implementation. They need explicit attention to contextual assumptions. They need realistic assessment of implementer capacity.
Evidence from pilots tells us what can work under favorable conditions. It doesn't tell us what will work at scale. That question requires separate evidence, separate thinking, and a healthy skepticism toward the assumption that success travels.