Statistical inference is not merely a technical enterprise. Beneath every p-value, every posterior distribution, and every likelihood ratio lies a philosophical commitment about what counts as evidence and how evidence should constrain belief. The century-long dispute between frequentists and Bayesians is, at its core, a disagreement in formal epistemology—a clash over the logic of induction itself.

This connection is rarely made explicit in statistics textbooks, which tend to present methods as tools rather than philosophical positions. Yet the founders of modern statistics—Fisher, Neyman, Pearson, Jeffreys, de Finetti—understood that they were building competing theories of evidential reasoning. Their debates were continuous with Carnap's inductive logic, Ramsey's subjective probability, and Popper's falsificationism. The philosophy was never separable from the mathematics.

Bridging these traditions reveals something remarkable: many of the most contentious practical controversies in statistics—whether to report p-values or Bayes factors, whether stopping rules matter, whether the likelihood principle is valid—reduce to precise formal questions about the structure of confirmation. Once we see statistical inference as applied inductive logic, these disputes become tractable. They acquire clear premises, identifiable assumptions, and, in some cases, definitive resolutions. What follows is an examination of three loci where the philosophy of confirmation and the practice of statistics meet most consequentially.

Confirmation and Significance: Two Languages for Evidence

Consider a simple evidential question: data E have been observed under hypothesis H. Does E support H? Bayesian confirmation theory and Fisherian significance testing offer structurally different answers. In confirmation theory, E confirms H just in case P(H | E) > P(H)—equivalently, when the likelihood ratio P(E | H) / P(E | ¬H) exceeds unity. This is a comparative measure: evidence is defined relationally, by how much more expected the data are under one hypothesis than its negation.

Fisherian significance testing proceeds differently. It asks: under the null hypothesis H₀, how probable is a result at least as extreme as the one observed? The p-value P(X ≥ x | H₀) measures not how well the data confirm an alternative, but how surprising they are under the null. This is a single-hypothesis measure—no alternative is formally specified. Fisher regarded small p-values as evidence against H₀, but the framework does not quantify evidence for anything.

The divergence has concrete consequences. A p-value of 0.04 might seem to favor the alternative hypothesis, but the corresponding Bayes factor can easily favor the null—a phenomenon Berger and Sellke (1987) demonstrated systematically. For one-sided tests with moderate sample sizes, p-values near 0.05 often correspond to Bayes factors between 3 and 5 in favor of the null, depending on the prior over alternatives. The two frameworks do not merely quantify evidence differently; they can issue contradictory verdicts.

The root of the disagreement traces to the relevance criterion in confirmation theory. Bayesian confirmation is inherently contrastive—evidence must discriminate between hypotheses. Significance testing violates this by evaluating data against a single hypothesis without requiring a specified alternative. From the standpoint of inductive logic, this renders the p-value an incomplete evidential measure. It answers the question "Are the data surprising under H₀?" but not the epistemologically prior question "Do the data discriminate between H₀ and any particular H₁?"

This does not make significance testing useless—Fisher never intended it as a mechanical decision procedure—but it clarifies its epistemic limitations. The p-value functions as a measure of misfit, not a measure of confirmation. Conflating the two, as much applied statistics does, commits a category error that formal epistemology is uniquely positioned to diagnose. Understanding this distinction dissolves a surprising number of apparent paradoxes in statistical evidence, including the widely discussed "replication crisis" in the social sciences.

Takeaway

Evidence for a hypothesis is not the same as evidence against its negation. Any framework that evaluates data under only one hypothesis provides an incomplete account of evidential support, no matter how precise its calculations.

The Likelihood Principle: A Constraint on Evidential Meaning

The likelihood principle states that all the evidence about an unknown parameter θ contained in data x from experiment E is captured by the likelihood function L(θ) = P(x | θ, E). Formally: if two experiments E₁ and E₂ yield observations x₁ and x₂ with proportional likelihood functions—L₁(θ) ∝ L₂(θ) for all θ—then x₁ and x₂ carry identical evidence about θ. The principle is a direct consequence of Bayesian updating: since the posterior is proportional to the prior times the likelihood, any two datasets generating the same likelihood function produce the same posterior, regardless of the experimental design that generated them.

Birnbaum's (1962) celebrated theorem strengthened this result by deriving the likelihood principle from two premises most statisticians accept independently: the sufficiency principle (sufficient statistics capture all relevant information) and the conditionality principle (inference should be conditional on the experiment actually performed, when the design involves a random choice among experiments). If you accept both, the likelihood principle follows deductively. This placed frequentist statistics in a philosophical bind, since standard frequentist methods routinely violate the likelihood principle.

The violation is not subtle. Consider a classic example: a researcher who observes 3 successes in 12 Bernoulli trials. Under the binomial model (fixed n = 12), the p-value for testing θ = 0.5 is approximately 0.073. Under the negative binomial model (sampling until 3 successes), the same data yield a p-value of approximately 0.033. The likelihood functions are proportional—identical evidence about θ by the likelihood principle—yet the frequentist verdicts differ. One crosses the conventional 0.05 threshold; the other does not.

Frequentist responses to this challenge generally target Birnbaum's conditionality principle or dispute the scope of his theorem. Mayo (2014) and others have argued that the derivation equivocates between different senses of evidential equivalence. Evans (2013) offered a formal counterargument questioning the theorem's logical validity. These critiques have merit as logical analysis, but they have not produced a consensus frequentist alternative to the likelihood principle that preserves intuitive constraints on evidence.

From the standpoint of inductive logic, the likelihood principle functions as a coherence constraint—analogous to the probability axioms themselves. It says that the evidential import of data should depend on what was observed and how the observations relate to hypotheses, not on features of the experimental protocol that didn't occur. Accepting this principle reorganizes the foundations of statistics around the likelihood function as the primitive carrier of evidence, with Bayesian inference emerging as its natural extension and frequentist methods requiring auxiliary justification wherever they depart from it.

Takeaway

If two datasets tell exactly the same story about a parameter—identical likelihood functions—then any coherent account of evidence must treat them as equivalent, regardless of the sampling plan that produced them. The design you didn't execute shouldn't change what the data you did collect mean.

Stopping Rules and the Geometry of Evidence

No controversy illuminates the frequentist-Bayesian divide more sharply than the problem of stopping rules. Suppose a researcher plans to flip a coin 100 times but, after observing 60 heads in 80 flips, stops early because the result already seems decisive. For a Bayesian, this is entirely unproblematic: the posterior depends only on the observed data through the likelihood function, and the likelihood function is the same whether the researcher planned 80 flips, 100 flips, or decided to stop at 60 heads. The stopping rule is evidentially irrelevant.

For a frequentist, the stopping rule is constitutive of the inference. The p-value is defined as the probability of observing a result at least as extreme as the one obtained, and "at least as extreme" depends on the sample space—the set of outcomes that could have occurred. Changing the stopping rule changes the sample space, which changes the p-value, which changes the inference. A fixed-sample design and a sequential design with optional stopping can yield the same data but different p-values. This is not a technical wrinkle; it is a fundamental feature of the frequentist framework.

The epistemological stakes are profound. The frequentist position implies that the intentions of the experimenter—specifically, what they planned to do with data they never collected—are part of the evidence. Armitage, McPherson, and Rowe (1969) demonstrated that optional stopping with repeated significance tests at the 5% level produces a type I error rate far exceeding 0.05. This is a genuine problem for error control. But from the standpoint of confirmation theory, it reveals a tension: the data themselves are identical, yet the evidential assessment changes based on the researcher's mental state at the time of data collection.

Bayesians argue that this dependence on intentions is epistemically incoherent. If you hand identical datasets to two analysts—one who planned fixed-sample testing and one who planned sequential testing—they will reach different frequentist conclusions. The data are the same; only the counterfactual experimental designs differ. The Bayesian position, grounded in the likelihood principle, holds that evidence is a property of what was observed, not what might have been. This does not mean Bayesians ignore the practical concern of error accumulation—calibration can be achieved through prior specification and decision-theoretic frameworks—but the evidential analysis remains invariant to the stopping rule.

The formal resolution is clarifying. The frequentist's sample space is a feature of the operating characteristics of a procedure—its long-run error rates—not of the evidence in a particular dataset. Conflating these two objects is, from the perspective of inductive logic, a scope error. Operating characteristics evaluate methods; likelihoods evaluate evidence. Both are legitimate objects of analysis, but they answer different questions. Recognizing this distinction does not require abandoning frequentist tools, but it does require acknowledging that the frequentist framework, by design, makes claims about procedures rather than about the evidential meaning of particular observations.

Takeaway

The stopping rule problem forces a choice: either evidence depends on what the experimenter intended but never did, or evidence is determined solely by the data actually observed. Inductive logic favors the latter, placing the burden on frequentist methods to justify their dependence on unobserved counterfactuals.

The frequentist-Bayesian controversy is not a mere turf war between competing software packages. It is a philosophical disagreement about the structure of inductive reasoning, and formal epistemology provides the language to make its premises and consequences precise.

Each of the three focal points examined here—confirmation versus significance, the likelihood principle, and stopping rules—reveals the same underlying tension: whether evidence is a property of the data actually observed or of the procedure that generated them. Bayesian confirmation theory, the likelihood principle, and stopping-rule invariance form a mutually reinforcing triad. Accepting any one creates strong pressure toward the others.

This does not settle every question—prior specification, computational tractability, and calibration remain live issues. But it does clarify what is at stake. The philosophy of statistics is not an optional addendum to statistical practice. It is the practice, made explicit.