Development evaluation has an uncomfortable secret: a substantial portion of impact studies are fundamentally incapable of detecting the very effects they seek to measure. Organizations invest millions in randomized controlled trials, wait years for results, and then discover their study was underpowered—statistically unable to distinguish genuine program impacts from random variation. The null findings that emerge aren't evidence of program failure; they're evidence of evaluation failure.
This problem is neither exotic nor rare. A systematic review of published development RCTs found that roughly half had inadequate statistical power to detect meaningful effects. The consequences extend far beyond wasted budgets. Underpowered studies generate false negatives that kill promising interventions. They produce imprecise estimates that provide no actionable guidance. Worse, when underpowered studies occasionally find statistically significant effects, those findings are disproportionately likely to be exaggerated or spurious—a phenomenon that corrupts the evidence base systematically.
Power analysis is the mathematical discipline that prevents this waste. It forces evaluators to confront uncomfortable questions before spending a single dollar on data collection: What effect size can we realistically detect? What does that imply about our sample requirements? And if those requirements exceed our constraints, should we proceed at all? Mastering this analysis separates rigorous evaluators from those who generate noise and call it evidence.
Minimum Detectable Effects: The Foundation of Credible Design
Every evaluation can detect only effects above a certain threshold. This minimum detectable effect (MDE) is determined by four factors: your sample size, the underlying variance in your outcome, your tolerance for false positives (Type I error), and your tolerance for false negatives (Type II error). The relationship is mathematically precise, and ignoring it doesn't make it disappear—it simply guarantees you'll discover your limitations in the worst possible way, after the study concludes.
The standard formula for a two-sample comparison reveals the mechanics clearly. For 80% power and 5% significance, you need approximately 16 times the squared ratio of outcome standard deviation to minimum detectable effect, per treatment arm. If your cash transfer program aims to increase household consumption by $100 annually, and consumption has a standard deviation of $500, you need roughly 400 households per arm—800 total—just to reliably detect that effect.
The critical insight is that MDE should drive design, not emerge from it. Too many evaluators begin with their available budget, calculate the sample it affords, and only then compute what effect they can detect. They discover their MDE is 0.4 standard deviations when similar programs have shown effects of 0.15 standard deviations. At that point, the evaluation is already compromised. Rigorous practice inverts this sequence. You begin by determining what effect size would justify the program's costs, then calculate required samples, then assess whether those requirements are feasible.
This approach demands honest confrontation with program theory. If a microfinance intervention costs $200 per beneficiary and you need 15% income gains to justify that cost, your MDE must be substantially smaller than 15%—otherwise you cannot distinguish 'program works' from 'program doesn't work' with any confidence. Effect sizes from the literature, pilot studies, or theoretical models should anchor your calculations. Optimistic assumptions about detectable effects produce studies that detect nothing useful.
The statistical reality is unforgiving: halving your MDE requires quadrupling your sample size. This nonlinear relationship means that detecting small but policy-relevant effects often requires samples an order of magnitude larger than initial intuition suggests. A school feeding program might improve test scores by 0.1 standard deviations—a meaningful effect aggregated across millions of children—but detecting it reliably requires thousands of students, not hundreds.
TakeawayBefore designing any evaluation, calculate the minimum effect size your study can detect, then ask whether effects below that threshold would still be policy-relevant. If yes, your study cannot answer the question you're asking.
Cluster Randomization Complications: When Units Aren't Independent
Development interventions rarely randomize individuals. Programs operate through schools, villages, health clinics, or administrative units, and contamination concerns often require randomizing at these cluster levels. This seemingly practical choice carries profound statistical consequences that routinely catch evaluators unprepared. Cluster randomization can increase required sample sizes by factors of five, ten, or more—transforming feasible evaluations into impossibly expensive ones.
The mechanism is the intra-cluster correlation coefficient (ICC), which measures how much more similar individuals within clusters are compared to individuals across clusters. In a village-randomized agricultural program, farmers in the same village share soil quality, weather patterns, market access, and social networks. Their outcomes are correlated even before any intervention. This correlation means that 50 farmers in one village provide far less statistical information than 50 farmers spread across 50 villages.
The design effect formula quantifies this penalty: DE = 1 + (m-1)ρ, where m is cluster size and ρ is the ICC. With 50 farmers per village and an ICC of 0.05—modest by development standards—your effective sample size is reduced by a factor of 3.5. That 800-person sample providing 80% power under individual randomization now requires nearly 3,000 individuals to achieve the same power under cluster randomization. ICCs of 0.10 or 0.15, common for economic outcomes in developing contexts, make the penalty even more severe.
The strategic implications reshape evaluation design fundamentally. Power increases faster by adding clusters than by adding individuals within clusters. Twenty villages of 30 households each provide more statistical power than 10 villages of 60 households, even though the total sample is identical. This insight should drive sampling strategies, budget allocation, and geographic scope decisions. Yet evaluators routinely oversample within clusters because per-household costs decline with cluster concentration, not recognizing they're purchasing statistical inefficiency.
Obtaining reliable ICC estimates before finalizing design is therefore essential. Baseline data, previous studies in similar contexts, or administrative records can provide preliminary estimates. Conservative assumptions—using higher plausible ICC values—protect against underpowering. Some evaluators deliberately collect pilot data specifically to estimate ICCs, recognizing that this small upfront investment prevents catastrophic design failures downstream.
TakeawayWhen randomizing at cluster level, your effective sample size may be one-third to one-fifth of your actual sample size. Always calculate design effects using context-appropriate intra-cluster correlations before finalizing budgets.
Power Under Constraints: Strategic Optimization When Resources Bind
Real evaluations operate under binding constraints. Budgets are fixed, geographic scope is limited by partner capacity, political considerations restrict treatment assignment, and timelines impose data collection windows. The question is rarely whether constraints exist but how to maximize statistical power within them. This optimization problem has concrete solutions that can salvage evaluations that initially appear infeasible.
Stratified randomization offers consistent power gains with minimal cost. By ensuring treatment and control groups are balanced on predictors of the outcome—baseline values, geographic strata, demographic characteristics—you reduce residual variance and increase effective power. The gains are not dramatic, typically 10-20%, but they're essentially free and compound with other strategies. Any evaluation not stratifying on key observables is leaving power on the table.
ANCOVA specifications that control for baseline outcomes provide similar benefits during analysis. If you measure household consumption before and after intervention, analyzing change scores or controlling for baseline consumption in your regression reduces outcome variance substantially. Pre-analysis plans should specify these approaches, ensuring power gains are realized and not lost to analytical choices made after observing results.
When sample sizes are genuinely fixed and inadequate for detecting the primary outcome, consider intermediate outcomes with less variance or stronger expected effects. A job training program's impact on formal employment might require 5,000 participants to detect, but impacts on job search behavior, interview callbacks, or skills assessments might be detectable with 1,000. These proximate outcomes sacrifice some policy relevance for statistical feasibility—a tradeoff worth making explicitly rather than discovering post-hoc that your primary outcome analysis is inconclusive.
The most important constraint-optimization strategy is knowing when not to proceed. If power calculations reveal that your evaluation cannot detect effects smaller than implausibly large magnitudes, the rigorous response is to redesign, rescope, or decline the evaluation entirely. Publishing underpowered studies imposes negative externalities on the entire evidence base. The intellectual honesty to abandon infeasible evaluations—or to clearly label them as exploratory and underpowered—is itself a form of methodological rigor that distinguishes credible evaluators from those who generate noise regardless of statistical foundations.
TakeawayWhen facing fixed constraints, exploit stratification, baseline controls, and intermediate outcomes to maximize power. But recognize that some evaluations should not proceed—an underpowered study wastes resources and pollutes the evidence base.
Power analysis is not a bureaucratic hurdle in the evaluation approval process—it is the mathematical foundation that determines whether your study can generate evidence or merely the appearance of evidence. The calculations themselves are straightforward; the discipline lies in conducting them honestly, early, and with willingness to accept uncomfortable conclusions about study feasibility.
The development sector's evidence base suffers from a systematic survivor bias: we see published studies, many underpowered, but not the thousands of hours and millions of dollars spent on evaluations that were doomed from design stage. Every rigorous power calculation that prevents an infeasible study represents resources redirected toward evaluations that can actually inform policy.
Demand power calculations in every evaluation you commission, fund, or review. Scrutinize the assumed effect sizes, variance estimates, and ICC values underlying sample size justifications. The credibility of evidence-based development depends on evaluators who understand that statistical power isn't a technical footnote—it's the difference between knowledge and noise.