Every statistical test carries a small risk of being wrong—typically 5%. That seems manageable. But here's the trap: run twenty tests at that threshold, and you're mathematically guaranteed to find something that looks real but isn't.
This isn't about sloppy analysis or poor methodology. It's pure probability working against you. The more questions you ask of your data, the more likely you'll stumble upon a convincing false positive. Understanding this trap is essential for anyone who works with data—or who needs to evaluate claims based on statistical evidence.
Probability Accumulation: Why More Tests Mean More False Positives
Think of each statistical test like rolling a die. If you're testing at the 5% significance level, you have a 95% chance of correctly concluding 'nothing here' when there's truly nothing. Good odds for a single roll.
But probability accumulates in counterintuitive ways. Two tests? Your chance of at least one false positive jumps to nearly 10%. Ten tests? About 40%. Twenty tests? Over 64%. By the time you've run 100 tests on random noise, you're virtually certain to find 'significant' results—even though there's nothing real to discover.
This isn't a flaw in statistics; it's how probability works. Each test is independent, but the cumulative risk grows with every question you ask. Researchers testing hundreds of genetic markers, traders backtesting dozens of strategies, or analysts checking multiple customer segments all face this same mathematical reality. The care you take on each individual test doesn't protect you from the aggregate risk.
TakeawayFalse discoveries aren't mistakes—they're the inevitable mathematical consequence of asking many questions. At a 5% false positive rate, twenty tests virtually guarantee at least one spurious finding.
Family-Wise Error Rate: Controlling Overall False Discovery Probability
The solution starts with a mental shift: stop thinking about individual tests and start thinking about families of tests. A family-wise error rate asks: what's the probability that any test in my entire analysis produces a false positive?
The simplest correction is the Bonferroni method. If you want a 5% family-wise error rate across 20 tests, divide 0.05 by 20. Now each individual test must reach the 0.0025 significance level to count. This is deliberately conservative—it makes false positives rare but also makes genuine discoveries harder to detect.
More sophisticated approaches like the Benjamini-Hochberg procedure control the false discovery rate instead—the expected proportion of false positives among your significant results. This trades some protection for better sensitivity. If you're exploring data and can tolerate some false leads, this gentler correction lets more genuine signals through while still preventing chaos.
TakeawayShift your thinking from individual test accuracy to overall analysis accuracy. The family-wise error rate captures what actually matters: how often your entire investigation leads you astray.
Choosing the Right Correction for Your Situation
Not all multiple testing situations demand the same solution. The key factors are your tolerance for false positives, your tolerance for missed discoveries, and whether your tests are related or independent.
Bonferroni works best when false positives are costly and you're running relatively few tests. Medical trials testing a drug against multiple outcomes often use this approach—wrongly approving a treatment has serious consequences. But apply Bonferroni to thousands of tests and you'll miss almost everything real.
Benjamini-Hochberg suits exploratory analysis where you'll follow up on findings. Genomics researchers screening thousands of genes accept that 5% of their 'hits' might be false because they'll verify the promising ones with additional experiments. For pre-registered confirmatory analysis—testing a specific hypothesis you committed to beforehand—you might not need correction at all, since you're genuinely running one test. The discipline of pre-registration protects against the hidden multiple testing of 'trying different analyses until something works.'
TakeawayMatch your correction method to your context. Conservative corrections suit high-stakes confirmatory work; gentler corrections serve exploratory investigation where follow-up verification is planned.
The multiple comparisons problem isn't a technical gotcha—it's a fundamental feature of how evidence works. Every additional question you ask of your data increases the chance of finding something that isn't there. Recognizing this changes how you evaluate both your own analyses and others' claims.
When someone reports a surprising finding, ask: how many things did they test? If they don't know or won't say, treat the result with appropriate skepticism. In your own work, decide upfront what you're testing and how you'll correct for multiplicity. The math will do its thing either way—better to be prepared.