The randomized controlled trial stands as development economics' most powerful tool for establishing causality. When executed properly, random assignment ensures that treatment and control groups differ only by chance at baseline, allowing us to attribute any subsequent differences to the intervention itself. This methodological elegance has transformed how we understand program effectiveness.

But the field reality is far messier than the textbook presentation suggests. Randomization can fail in ways that are subtle, pervasive, and often invisible to researchers who don't know what to look for. When your control group stops being a valid counterfactual, your entire impact estimate becomes meaningless—or worse, systematically biased in directions that lead to incorrect policy conclusions.

The uncomfortable truth is that many published impact evaluations contain randomization failures that went undetected or unreported. Understanding how randomization breaks down isn't just methodological housekeeping—it's essential for anyone designing, implementing, or interpreting experimental evidence in development settings. The integrity of evidence-based policy depends on our ability to diagnose these problems before they contaminate our conclusions.

Contamination Mechanisms: How Treatment and Control Groups Blur

The assumption underlying experimental estimates is that treatment and control groups experience different conditions. Contamination occurs when this separation breaks down—when control group members receive some version of the treatment, or when treatment group members fail to receive it. The technical term is noncompliance, but the practical manifestations are endlessly varied.

Geographic spillovers represent the most common contamination pathway. In education interventions, students in control schools may transfer to treatment schools. In agricultural programs, farmers share improved seeds or techniques with neighbors across village boundaries. Health information spreads through social networks regardless of experimental assignment. Any intervention that works through knowledge or behavior change is inherently difficult to contain.

Implementer errors create contamination at scale. Field staff may misunderstand assignment protocols, include ineligible households in treatment, or exclude assigned beneficiaries. In one microfinance evaluation I reviewed, loan officers had reassigned nearly 15% of participants based on their own judgment about creditworthiness—completely invalidating the randomization. The intentions were good; the methodological damage was severe.

Market-level effects contaminate experiments in ways researchers often miss entirely. A cash transfer program may raise local prices, affecting control households. A job training program may increase competition for limited positions, displacing control group members who would otherwise have been hired. Your treatment effect estimate captures the net impact of helping some people while potentially harming others in the same labor market.

The instrumental variables approach—estimating the effect of treatment-on-the-treated rather than intent-to-treat—can address some contamination. But this requires additional assumptions about the contamination process and typically increases standard errors substantially. Prevention through careful design beats statistical correction every time.

Takeaway

Contamination doesn't just add noise—it systematically biases estimates toward zero by making treatment and control conditions more similar than intended.

Differential Attrition: The Silent Killer of Internal Validity

Randomization creates comparable groups at baseline. Differential attrition destroys that comparability over time. If treatment and control groups lose different types of participants at different rates, the surviving samples are no longer comparable—and your impact estimate reflects selection effects rather than program effects.

The mechanism is straightforward but insidious. Imagine a job training program where discouraged participants drop out. If the program is effective, treatment group members may be more likely to stay engaged—but they're staying engaged precisely because they're the type who would succeed anyway. Alternatively, successful treatment graduates may exit your sample because they've moved for new jobs, while unsuccessful control members remain visible in follow-up surveys.

Attrition rates alone don't tell you whether you have a problem. Balanced attrition—equal dropout rates in treatment and control—can still bias estimates if different types of people are leaving each group. A training program might lose 20% from both arms, but if treatment loses the most motivated participants to employment while control loses the least motivated to discouragement, your remaining samples are fundamentally different.

Detecting differential attrition requires examining not just overall rates but the characteristics of attriters. Compare baseline characteristics of those lost to follow-up against those retained, separately by treatment status. If treatment attriters look systematically different from control attriters, you have a problem that no statistical adjustment can fully resolve.

Bounding exercises can establish the range of possible true effects given attrition. The most conservative approach assumes all treatment attriters would have had the worst possible outcomes and all control attriters the best. If your effect survives this extreme assumption, it's robust. In practice, these bounds are often too wide to be informative—which itself tells you something important about what you can actually conclude from the data.

Takeaway

When different types of people disappear from treatment and control groups, you're no longer comparing what randomization intended you to compare.

Balance Table Interpretation: Reading the Warning Signs

The balance table is your first diagnostic tool for assessing whether randomization succeeded. It compares baseline characteristics between treatment and control groups, testing whether observed differences could have arisen by chance. But knowing how to read a balance table—and knowing what it can't tell you—separates rigorous evaluators from naive ones.

Start with the obvious: examine p-values for each baseline characteristic. With proper randomization, roughly 5% of variables should show statistically significant differences at the 0.05 level purely by chance. If 20% of your variables are imbalanced, something has gone wrong. But don't stop at counting significant differences—examine the magnitudes. Small absolute differences in variables that strongly predict outcomes matter more than large differences in irrelevant characteristics.

The joint F-test provides a more powerful diagnostic than examining variables individually. Regress treatment assignment on all baseline characteristics simultaneously and test whether they jointly predict assignment. Proper randomization should produce an insignificant F-statistic. A significant result indicates systematic selection that individual variable tests might miss.

Watch for suspicious patterns that suggest manipulation or implementation failures. Perfect balance across all variables is actually a red flag—it suggests researchers may have re-randomized until they got a "good" draw or selectively dropped problematic observations. Clustering of imbalances around particular characteristics may indicate that randomization broke down for specific subgroups or implementation sites.

Missing baseline data complicates balance assessment in ways researchers often underappreciate. If certain variables have substantial missingness, balance tests on available data may be misleading. The households for whom you have baseline data may not represent the full randomized sample. Document missingness rates by treatment status and consider whether the data collection process itself may have been influenced by treatment assignment.

Takeaway

A balance table tells you whether randomization worked as planned—but only if you know which patterns indicate genuine failures versus expected statistical noise.

Randomization failures don't announce themselves. They hide in implementation details, reveal themselves only through careful diagnostic work, and often remain invisible in published papers that report balance tables without adequate scrutiny. The burden falls on evaluators to actively hunt for problems rather than assume their experiment executed as designed.

The implications extend beyond individual studies. When randomization failures go undetected, they contaminate the evidence base that informs policy. Programs get scaled based on inflated effect sizes, or abandoned based on attenuated estimates that reflect methodological problems rather than genuine ineffectiveness.

Building a culture of rigorous self-examination—where researchers treat potential randomization failures as likely rather than exceptional—is essential for evidence-based development to deliver on its promise. Your control group is only a control if you've verified it at every stage.