Development economists face a persistent challenge: we need causal evidence to guide policy, but randomized experiments aren't always feasible. Ethical constraints, political realities, or simple timing often mean we're working with data from programs already implemented. The question becomes whether we can extract credible causal estimates from observational data.
Difference-in-differences (DiD) represents one of our most powerful tools for this task. The method exploits a simple but profound insight: if we observe two groups over time—one that receives treatment and one that doesn't—we can difference out both time-invariant group characteristics and common temporal shocks. What remains, under the right conditions, approximates the causal effect we seek.
Yet DiD requires careful thinking about assumptions that are often violated in practice. Recent methodological advances have revealed that standard implementations can produce severely biased estimates, particularly when treatments roll out gradually across units. For development practitioners designing evaluations or interpreting existing studies, understanding these nuances isn't optional—it's essential for distinguishing credible evidence from statistical artifacts. This analysis walks through what DiD can and cannot tell us, and how to implement it rigorously.
Parallel Trends Assumption: The Foundation That Must Hold
The entire DiD edifice rests on one assumption: absent treatment, the treated and control groups would have followed parallel outcome trajectories. This is fundamentally untestable because we never observe the treated group's counterfactual path. Yet everything depends on its plausibility.
Consider evaluating a microcredit program that expanded to certain districts but not others. If the treatment districts were selected because they showed improving economic conditions—perhaps making program implementation easier—then comparing their post-program outcomes to control districts confounds program effects with pre-existing divergent trends. The DiD estimate attributes to the program what would have happened anyway.
Standard practice involves testing for parallel pre-treatment trends by examining whether outcomes moved together before intervention. This provides suggestive evidence but isn't definitive. Groups can share pre-trends yet diverge precisely when treatment occurs due to unobserved factors. Conversely, minor pre-trend differences might reflect noise rather than systematic divergence.
Development contexts present particular challenges. Governments rarely assign programs randomly across administrative units. Selection typically reflects capacity, need, or political considerations—all of which correlate with outcome trajectories. The credibility of any DiD analysis hinges on articulating why, in your specific context, the parallel trends assumption is defensible.
Several strategies strengthen plausibility assessments. Plotting raw outcome trends across groups before intervention reveals obvious divergences. Testing for parallel trends in other outcomes unlikely to be affected by treatment provides indirect evidence. Including group-specific linear trends can absorb some differential trajectory concerns, though this approach has its own limitations. Most importantly, institutional knowledge about program placement helps identify potential confounders that statistical tests cannot detect.
TakeawayParallel trends cannot be proven, only falsified or made plausible through institutional knowledge and pre-trend analysis—the strength of your causal claim depends entirely on the strength of this argument.
Staggered Adoption Complications: When Standard Methods Fail
Much development program evaluation involves staggered rollouts—interventions adopted by different regions or groups at different times. Standard two-way fixed effects (TWFE) regression, the workhorse DiD implementation, appears naturally suited to this setting. Recent econometric research has revealed this intuition is dangerously wrong.
The problem emerges from how TWFE constructs its estimate. It creates a weighted average of all possible two-by-two DiD comparisons in the data, including comparisons where already-treated units serve as controls for newly-treated units. When treatment effects evolve over time—growing, shrinking, or changing direction—these problematic comparisons can receive negative weights, potentially yielding estimates with the wrong sign.
Consider a conditional cash transfer program rolled out to provinces over five years. Early-adopting provinces might show large initial effects that fade as implementation challenges emerge. Later-adopting provinces might show smaller initial effects but steeper growth as they learn from early implementers. TWFE mixes these heterogeneous effects in unpredictable ways, potentially concluding the program is harmful when it benefits all participants.
The solution requires decomposing the aggregate estimate or using recently developed estimators explicitly designed for staggered settings. Methods proposed by Callaway and Sant'Anna, Sun and Abraham, and de Chaisemartin and D'Haultfœuille provide consistent estimates under treatment effect heterogeneity by carefully controlling which comparisons enter the analysis.
For development practitioners, the implication is stark: any DiD study using TWFE with staggered adoption deserves scrutiny. Does the author address potential negative weighting? Are results robust to alternative estimators? When evaluating a portfolio of programs implemented over time—as bilateral and multilateral agencies routinely do—standard pooled analyses may substantially misrepresent aggregate impact.
TakeawayTwo-way fixed effects with staggered treatment can produce estimates of the wrong sign when effects vary over time—modern DiD requires estimators designed for this heterogeneity.
Practical Implementation: From Theory to Credible Estimates
Implementing DiD rigorously requires a sequence of decisions that shape estimate credibility. Each choice involves tradeoffs, and transparency about these choices distinguishes careful evaluation from mechanical regression.
Control group selection often determines analysis success or failure. Ideal controls share characteristics predicting outcome trends but differ only in treatment assignment. Geographic neighbors frequently serve well, sharing economic conditions and policy environments. However, spatial spillovers—treatment effects diffusing to nearby areas—can bias estimates toward zero. Matching on pre-treatment characteristics or outcome levels can improve comparability but raises questions about which characteristics matter.
Pre-trend testing provides essential but imperfect evidence. Event study specifications—plotting treatment effects separately for each time period relative to intervention—visualize whether outcomes diverged before treatment. Formal tests examine whether pre-treatment coefficients differ significantly from zero, individually or jointly. Yet such tests have limited power, especially with few pre-treatment periods. A non-significant pre-trend test does not prove parallel trends; it merely fails to reject them.
Robustness checks probe whether conclusions depend on specific choices. Varying the control group, adding or removing covariates, changing the estimation window, and testing alternative outcome transformations reveal estimate stability. Placebo tests using outcomes that shouldn't respond to treatment help rule out spurious correlations. For staggered designs, comparing TWFE to heterogeneity-robust estimators is now essential.
Documentation matters as much as estimation. Credible DiD studies explicitly defend the parallel trends assumption with institutional knowledge, present pre-trend evidence graphically and statistically, and demonstrate robustness across reasonable specification choices. Readers should be able to assess why, in this specific context, the comparison plausibly identifies causal effects rather than confounded correlations.
TakeawayCredible DiD requires explicit defense of parallel trends, transparent pre-trend visualization, and demonstration that findings survive reasonable alternative specifications.
Difference-in-differences remains indispensable for development evaluation when experimental variation is unavailable. Its logic is intuitive and its data requirements modest compared to other quasi-experimental methods. But accessibility shouldn't breed complacency—credible implementation demands rigorous attention to assumptions that are easily violated and often difficult to assess.
The methodological advances of recent years have raised the bar for DiD practice. Staggered adoption requires modern estimators, not mechanical TWFE. Parallel trends demands institutional justification, not just statistical tests. Robustness requires systematic exploration, not perfunctory sensitivity analysis.
For development practitioners interpreting research or designing evaluations, these standards matter. Policy decisions built on flawed causal inference waste resources and, worse, may harm the populations we aim to serve. DiD offers a path from observational data to causal knowledge—but only when we respect the assumptions that make that path possible.