The Scaling Problem: Why Behavioral Interventions Shrink in the Real World

5 min read

Behavioral interventions consistently show smaller effects when scaled from controlled trials to real-world implementation, a phenomenon known as the voltage drop.

Key mechanisms include selection bias in pilots, implementation fidelity decay, context dependence, and general equilibrium effects.

Interventions that change defaults, modify choice architecture, and align with implementer incentives tend to survive scaling best.

Programs relying on charismatic delivery, intensive coaching, or favorable local conditions show the steepest declines.

Designing for scalability requires testing with average participants, building fidelity supports, stress-testing contexts, and specifying which components must be preserved versus adapted.

A behavioral intervention shows a 30% improvement in a randomized controlled trial. Excited policymakers fund a national rollout. Two years later, the effect has dwindled to 5%—or vanished entirely. This pattern is so common it has a name: the voltage drop.

The phenomenon isn't a failure of behavioral science. It's a predictable consequence of moving from carefully controlled conditions into the messy realities of clinics, classrooms, workplaces, and households. The question isn't whether interventions will lose power at scale, but why—and what we can do about it.

Understanding the voltage drop matters because the gap between efficacy and effectiveness is where most behavior change initiatives die. For practitioners designing programs, funders evaluating proposals, and researchers building evidence bases, the scaling problem deserves the same rigor we apply to the interventions themselves.

Voltage Drop Explained

Economist John List has identified several mechanisms that systematically erode intervention effects during scaling. The first is selection bias in pilot studies. Early trials often recruit motivated participants and enthusiastic implementers—people who self-selected into testing something new. When the program reaches a broader, less motivated population, response rates fall.

The second mechanism is implementation fidelity decay. In a pilot, researchers monitor delivery closely. Facilitators are trained extensively, materials are pristine, and protocols are followed. At scale, training compresses, supervision thins, and local adaptations accumulate. The intervention being delivered in year three often bears only passing resemblance to the one that was tested.

Context dependence is the third culprit. Many interventions work because of specific conditions in the pilot site: a charismatic leader, a particular organizational culture, complementary services nearby. These conditions don't replicate uniformly across thousands of sites. What looked like an intervention effect was partly an environment effect.

Finally, there are general equilibrium effects. An intervention that helps a few job seekers find work may have little effect when applied to everyone competing for the same limited jobs. Effects that scale linearly in trials often hit diminishing returns or market saturation in the real world.

Takeaway
A pilot study measures what an intervention can do under ideal conditions, not what it will do under typical ones. Treat efficacy estimates as upper bounds, not expectations.

What Survives Scaling

Not all interventions degrade equally. Meta-analytic reviews across health, education, and policy domains reveal patterns in which features tend to maintain their effects at scale. Simplicity consistently emerges as protective. Interventions requiring few steps, minimal training, and limited behavioral judgment from implementers travel better than complex multi-component programs.

Default changes and choice architecture modifications also scale relatively well. Automatic enrollment in retirement plans, opt-out organ donation systems, and reordered cafeteria layouts work because they don't depend on sustained human delivery. The structural change itself does the work, regardless of who is administering it.

Interventions that align with existing incentives show greater durability. When the behavior change benefits the implementer as much as the target population, fidelity is naturally maintained. Programs that require teachers, doctors, or managers to work harder for someone else's outcomes tend to erode quickly.

Conversely, interventions relying on messenger effects, intensive personal interaction, or precise behavioral coaching show the steepest voltage drops. The very features that made them powerful in trials—a skilled counselor, a trusted authority figure—become the bottleneck at scale. What worked because of a person rarely survives without that person.

Takeaway
The most scalable interventions change the environment rather than the individual. Ask not what behavior you want to produce, but what structure would make that behavior easier than the alternative.

Design for Scalability

Anticipating the voltage drop changes how interventions should be designed from the outset. The first principle is testing at the margin, not the center. Pilot studies should deliberately recruit average participants and typical implementers, not the most enthusiastic ones. The estimate you get is closer to what you'll actually achieve.

Second, build in fidelity supports that don't require expertise. Checklists, digital prompts, and standardized materials reduce the variability that erodes effects. The goal is to make correct implementation easier than incorrect implementation. If the program only works when delivered by someone exceptional, it won't work at scale.

Third, stress-test for context. Before scaling, deliberately implement the intervention in sites that lack the favorable conditions of the original pilot. If effects evaporate without a particular type of leadership or local infrastructure, you've identified a boundary condition that must be addressed—or accepted as a scope limitation.

Finally, plan for adaptation rather than fighting it. Local modifications are inevitable. Identify which components are core (must be preserved) versus peripheral (can be adapted) and communicate this clearly. Interventions that specify their active ingredients survive translation; those that demand wholesale replication don't.

Takeaway
Scalability isn't something you add after a successful trial. It's a property you build in from the first study, by designing for the conditions you'll actually face.

The voltage drop isn't a verdict against behavioral science—it's a feature of operating in complex human systems. The interventions that change millions of lives are rarely the most powerful ones in trials. They are the ones built to survive contact with reality.

For practitioners, this means evaluating evidence with a scaling lens. Ask not just whether something worked, but where, how, and under what conditions. A modest effect that holds at scale is worth more than a dramatic effect that doesn't.

The next generation of behavioral interventions will be judged not by their peak voltage but by what current still flows when the lights are turned on across an entire system.