Compliance and Intent-to-Treat: What Your Estimates Actually Mean

5 min read

Intent-to-treat estimates capture the effect of offering a program, while treatment-on-treated estimates isolate the effect among actual recipients.

ITT is often the policy-relevant parameter because real-world programs cannot enforce perfect compliance.

Instrumental variable approaches recover local average treatment effects for compliers, who may differ systematically from the broader target population.

Best practice requires reporting both estimates alongside first-stage compliance and complier characteristics.

The gap between ITT and TOT is itself diagnostic, revealing whether implementation or intervention design is the binding constraint.

Imagine you randomize 2,000 households into a deworming program. Half are offered free treatment; half are not. Six months later, you measure school attendance. But here's the wrinkle that haunts every applied economist: only 60 percent of the treatment group actually took the pills, and 5 percent of the control group obtained them through other channels. What, exactly, are you estimating when you compare outcomes across your original assignment?

This question—deceptively simple—sits at the heart of how we interpret experimental evidence in development economics. The choice between intent-to-treat (ITT) and treatment-on-treated (TOT) estimates is not merely technical bookkeeping. It reflects fundamentally different policy questions, applies to different populations, and rests on different identifying assumptions.

Yet too often, papers report one estimate without sufficient justification, or worse, conflate the two in policy recommendations. A program with strong TOT effects but weak ITT effects may be ineffective at scale precisely because compliance is the binding constraint. Understanding this distinction—and communicating it transparently—is essential for translating rigorous evaluation into rigorous policy.

ITT as the Policy-Relevant Estimand

The intent-to-treat estimate compares outcomes between everyone assigned to treatment and everyone assigned to control, regardless of whether they actually received the intervention. This sounds like a methodological compromise—why dilute your estimate with non-compliers?—but it is often precisely the parameter policymakers should care about.

Consider the deworming example. A government deciding whether to fund a national program does not get to choose only those households that will perfectly comply. It must offer the program and accept that uptake will be partial. The ITT estimate captures exactly this real-world counterfactual: the expected impact of making the program available, including the friction, forgetfulness, and skepticism that accompany any intervention.

This makes ITT what Angrist and Pischke call the parameter of policy relevance. It internalizes implementation realities that often determine whether a program works at scale. A bed net distribution scheme that achieves spectacular health outcomes among the 30 percent who hang their nets correctly is, from a budgetary standpoint, a 30-percent-effective program.

ITT also has the virtue of preserving the experiment's integrity. Random assignment guarantees unbiased comparison; conditioning on actual treatment receipt—an endogenous decision—reintroduces selection bias unless carefully instrumented. Reporting ITT first respects the design that generated the evidence.

The practical implication is clear: when external validity to scaled implementation matters, ITT is not a second-best estimate. It is the estimate. The question is not whether the program works for those who use it, but whether offering it changes outcomes in the population.

Takeaway
Effectiveness at scale equals efficacy under perfect compliance multiplied by the compliance rate you can actually achieve. Ignoring the second term turns evaluation into wishful thinking.

Complier Average Causal Effects and Their Limits

When researchers want to estimate the effect of treatment received rather than treatment offered, the standard tool is two-stage least squares using random assignment as an instrument for actual uptake. Under the LATE assumptions—relevance, exclusion, monotonicity, and independence—this recovers the local average treatment effect: the impact of treatment on those whose participation status was changed by the random assignment.

These individuals are compliers. They took the pills because they were offered, and would not have obtained them otherwise. The LATE is a real causal parameter for a real subpopulation, and it is often substantially larger than the ITT, since it strips out the dilution from never-takers.

But here is the subtlety that bedevils policy translation: compliers are not necessarily representative of the target population. They may be more health-conscious, more trusting of authorities, more accessible to enumerators, or simply less time-constrained. The LATE describes treatment effects for a self-selected group whose characteristics we cannot fully observe.

This matters enormously when scaling. If a program achieves high compliance among an enthusiastic minority but cannot reach the indifferent majority, the LATE may overstate population-level impact even after multiplying by realistic uptake rates. The compliers in your evaluation may be precisely the marginal users your scaled program will struggle to retain.

Methods like marginal treatment effect estimation, complier characteristic analysis (following Abadie and others), and sensitivity bounds can illuminate who the compliers are. Reporting these alongside the LATE itself is essential for honest extrapolation.

Takeaway
Local average treatment effects are local in a deeper sense than the acronym suggests—they describe a population defined by responsiveness to your specific encouragement, not by policy interest.

Reporting Standards for Transparent Interpretation

Best practice in experimental development economics now demands that papers report both ITT and compliance-adjusted estimates, alongside first-stage compliance rates and complier characteristics. This is not redundancy—each estimate answers a distinct question, and the gap between them is itself informative.

A small ITT paired with a large LATE signals a compliance problem: the intervention works for those who take it up, but uptake is the binding constraint. The policy lever is delivery, not redesign. Conversely, an ITT close to the LATE indicates near-universal compliance, suggesting the intervention itself drives the result and external validity hinges on intervention content rather than implementation.

Authors should clearly state which estimand corresponds to which policy question. Phrases like "the program increased earnings by 12 percent" without qualification obscure whether this refers to offering, receiving, or some hybrid. Pre-analysis plans should specify the primary estimand and the rationale.

Compliance dynamics also deserve substantive discussion, not just a footnote. What share of the treatment group complied? What share of the control group accessed treatment through other channels (contamination)? How were these measured—self-report, administrative records, biomarkers? Each measurement choice shapes interpretation.

Finally, scaling discussions require explicit reasoning about whether compliance patterns will replicate. Pilot studies often achieve compliance through intensive monitoring that government implementation cannot match. An honest evaluation discusses this gap, perhaps presenting bounds under alternative compliance assumptions.

Takeaway
The discipline of reporting both ITT and TOT—and the gap between them—forces researchers to confront the difference between what an intervention can do and what it will do.

The intent-to-treat versus treatment-on-treated distinction is not a methodological footnote. It is the line between estimating what works in principle and what works in practice—and policy lives in the gap.

Researchers who report only TOT estimates risk overselling their interventions to policymakers who cannot replicate the compliance conditions of a well-managed trial. Those who report only ITT may understate genuine treatment efficacy and miss diagnostic information about implementation bottlenecks. Both estimates, properly contextualized, illuminate different facets of the same intervention.

The path forward is methodological humility paired with interpretive precision: report both, characterize compliers, discuss compliance dynamics substantively, and reason explicitly about how implementation conditions will translate to scale. Rigorous evidence demands rigorous communication—anything less squanders the scientific investment that experimental evaluation represents.