How Statistical Significance Became Misunderstood

5 min read

P-values answer a narrow technical question that differs fundamentally from what researchers actually want to know.

The current dominance of significance testing emerged from a confused fusion of incompatible statistical frameworks.

Publication norms, training traditions, and career incentives entrenched flawed practices across generations of researchers.

Alternative approaches like effect sizes, confidence intervals, and Bayesian methods offer improvements but carry their own trade-offs.

Reforming statistical practice requires restructuring institutions, not merely persuading individual scientists.

How did a single threshold—p less than 0.05—come to function as the gatekeeper of scientific truth across disciplines as varied as psychology, medicine, and economics? The question matters because the answer reveals something profound about how knowledge-producing communities can institutionalize confusion.

Statistical significance testing was never designed to do what researchers now ask of it. Yet for nearly a century, this modest computational procedure has been treated as a verdict on reality itself, a ritual performed at the altar of objectivity.

Understanding this drift from technique to dogma offers more than a methodological lesson. It illuminates how scientific communities, despite their commitment to rigor, can collectively misinterpret their own tools when institutional pressures align with cognitive convenience. The story of p-values is, ultimately, a case study in social epistemology.

What P-Values Mean

A p-value answers a remarkably narrow question: assuming the null hypothesis is true, how likely is it that we would observe data at least as extreme as what we actually obtained? It does not tell us the probability that the null hypothesis is true, nor does it tell us the probability that our findings are real.

This distinction matters enormously. Researchers want to know whether their hypothesis is correct, whether an effect exists, whether a treatment works. The p-value, however, addresses none of these questions directly. It is, as statistician Andrew Gelman has noted, an answer to a question almost no one is actually asking.

The confusion stems partly from the historical fusion of two incompatible frameworks. Ronald Fisher proposed significance testing as one tool among many for evaluating evidence. Jerzy Neyman and Egon Pearson developed hypothesis testing as a decision procedure with explicit error rates. Textbooks merged these traditions into an incoherent hybrid that promises more than either framework can deliver.

Compounding the problem, the threshold of 0.05 was always arbitrary—a convention Fisher himself described as conventional rather than principled. Yet this arbitrary cutoff has functioned as a sharp boundary between knowledge and noise, between publishable truth and ignored possibility.

Takeaway
A measurement instrument calibrated to answer one question cannot be repurposed to answer another simply because we wish it could. The questions our methods can address may differ profoundly from the questions we most want to ask.

Institutional Entrenchment

If p-values are so widely misunderstood, why have they persisted? The answer lies not in their epistemic merits but in the institutional ecosystem that grew around them. Journals developed publication norms favoring statistically significant results, creating what researchers call the file drawer problem—null findings languish unpublished while significant results, even spurious ones, enter the literature.

Career incentives reinforced this distortion. Tenure committees count publications, grant agencies fund promising preliminary results, and graduate programs train students in the methods that produced their advisors' successes. The system rewards those who play by its rules, however flawed those rules may be.

Training traditions perpetuate the confusion across generations. Statistics courses often present significance testing as the method of scientific inference rather than one approach among several. Students learn to perform the rituals before they understand the underlying philosophy, and many never revisit the foundations.

This is precisely the kind of collective lock-in that social epistemologists study. No individual researcher created the system, and most participants recognize its flaws, yet the equilibrium persists because deviating from convention carries professional costs. Reform requires coordinated change across journals, funders, and institutions—a slow and contested process.

Takeaway
Epistemic communities can sustain practices they collectively know to be flawed when individual incentives diverge from collective interests. Reform is rarely a matter of better arguments alone; it requires restructuring the institutions that make bad practices rational at the individual level.

Beyond Significance

Reform proposals abound, each with its own epistemological trade-offs. Effect sizes shift attention from whether an effect exists to how large it is—a more substantively meaningful question, particularly in applied fields where practical importance matters more than mere detectability.

Confidence intervals offer richer information than binary significance verdicts, displaying the range of plausible parameter values rather than collapsing evidence into a single threshold decision. They invite interpretive nuance, though they too can be misread as probability statements about parameters.

Bayesian methods promise to address what researchers actually want to know: given the evidence, how should we update our beliefs about competing hypotheses? Yet Bayesian approaches require specifying prior probabilities, introducing subjectivity that some view as a feature and others as a bug. The choice of prior can substantially affect conclusions, especially with limited data.

None of these alternatives is a panacea. Each carries its own potential for misuse, its own learning curve, its own institutional barriers to adoption. The deeper lesson may be that no single statistical procedure can substitute for careful reasoning about evidence, context, and theory. Pluralism in methods may serve scientific communities better than the search for a new universal standard.

Takeaway
The desire for a single, mechanical procedure to certify knowledge reflects a longing that science cannot fulfill. Judgment, context, and theoretical understanding cannot be outsourced to formulas, however sophisticated.

The p-value saga reveals how epistemic communities can mistake their tools for the truths those tools approximate. What began as a useful heuristic ossified into a ritual, sustained not by its merits but by the institutions built around it.

Current reform efforts—preregistration, registered reports, expanded methodological pluralism—represent attempts to redesign the social infrastructure of knowledge production. Whether they succeed depends less on philosophical arguments than on whether journals, funders, and universities can align incentives accordingly.

Perhaps the most enduring lesson is humility about our methods. The next generation will likely view some of our current practices with the same puzzlement we now direct at p-value worship. Epistemic progress requires not just better techniques but better awareness of how we might be misusing them.