Calibration, Coherence, and Expert Opinion

white framed sunglasses close-up photography

8 min read

Calibration (matching stated probabilities to observed frequencies) and coherence (satisfying probability axioms) are two distinct epistemic virtues governing probabilistic reasoning.

The Brier score decomposes forecast quality into calibration error, refinement, and irreducible variance, revealing that accuracy of confidence and informativeness of judgment are independent dimensions.

A central formal result shows that neither calibration entails coherence nor coherence entails calibration, meaning each can hold without the other.

Rational deference to expert probability judgments requires a conjunction of three verifiable conditions: coherence, calibration, and superior domain refinement.

This formal framework resolves persistent puzzles about expert disagreement, domain-specificity of expertise, and the epistemic value of track records.

When should you trust an expert's probability judgments? The question sounds straightforward, but it conceals a deep formal tension that epistemologists have only recently begun to resolve. Two distinct virtues govern probabilistic reasoning: calibration, the alignment of stated probabilities with observed frequencies, and coherence, the internal consistency of probability assignments under the axioms of probability theory. Both are desirable. Neither alone is sufficient.

The philosophical tradition has long privileged coherence. Dutch book arguments, dating back to Ramsey and de Finetti, demonstrate that incoherent agents face guaranteed losses. But coherence is a structural property — it tells you nothing about whether an agent's probabilities track reality. A perfectly coherent agent can be systematically wrong about everything. Calibration, by contrast, is an empirical property — it measures accuracy across repeated judgments. Yet calibration carries its own pathologies, ones that formal analysis makes precise.

This article develops the formal relationship between these two epistemic virtues and shows how their interaction generates principled criteria for expert deference. We begin with the mathematical definitions, proceed through independence results demonstrating that neither property entails the other, and conclude with a framework for evaluating expert probability judgments that integrates calibration, coherence, and domain-specific competence. The goal is a rigorous account of when probabilistic expertise deserves epistemic authority — and when it does not.

The Brier Score and Its Decomposition

Calibration admits a precise mathematical definition. An agent issuing probability forecasts is perfectly calibrated if, among all events assigned probability p, the long-run relative frequency of occurrence equals p. Formally, let an agent assign probabilities p₁, p₂, …, pₙ to events E₁, E₂, …, Eₙ, and let oᵢ ∈ {0, 1} denote whether Eᵢ occurred. Group the forecasts into bins B(p) = {i : pᵢ = p}. The agent is calibrated if, for each bin, the mean outcome converges to p as the bin grows large.

The standard scoring rule for evaluating probabilistic forecasts is the Brier score: BS = (1/n) Σ(pᵢ − oᵢ)². Lower scores indicate better performance. What makes the Brier score epistemically significant is its clean decomposition. Murphy (1973) showed that the Brier score decomposes as BS = CAL − REF + VAR, where CAL measures calibration error, REF measures refinement (the ability to sort events into groups with different base rates), and VAR is the irreducible variance of the outcome sequence.

This decomposition is philosophically illuminating. Calibration penalizes systematic bias — saying 70% when the frequency is 50%. Refinement rewards discrimination — the ability to assign different probabilities to genuinely different situations rather than issuing a uniform base-rate forecast for everything. An agent who always predicts the base rate is perfectly calibrated but maximally unrefined. An agent who sorts events into meaningfully different risk categories demonstrates genuine predictive competence.

The decomposition reveals that calibration and refinement are independent contributors to forecast quality. You can improve calibration without improving refinement, and vice versa. This independence is not merely technical — it reflects a deep epistemological distinction between accuracy of confidence and informativeness of judgment. A weather forecaster who says "30% chance of rain" every day in a climate where it rains 30% of the time is calibrated but useless. One who correctly varies predictions between 5% and 95% depending on atmospheric conditions provides genuine epistemic value.

For formal epistemology, the Brier decomposition establishes that probabilistic expertise is not a single property but a composite of at least two measurable dimensions. Any criterion for expert deference that attends only to calibration or only to refinement is formally incomplete. The decomposition also connects to proper scoring rules more broadly — the Brier score is strictly proper, meaning an agent minimizes expected score only by reporting true beliefs. This incentive-compatibility property ensures that the score measures genuine epistemic performance rather than strategic behavior.

Takeaway
Probabilistic expertise has two independent measurable dimensions: calibration (are your confidence levels honest?) and refinement (can you distinguish genuinely different situations?). Neither alone captures what it means to know what you're talking about.

Independence of Calibration and Coherence

The central formal result connecting calibration and coherence is negative: neither property entails the other. This independence is provable and carries significant consequences for epistemological theory. Consider first calibration without coherence. Construct an agent who assigns P(A) = 0.5, P(¬A) = 0.6. This violates additivity — the probabilities sum to 1.1. Yet if A occurs in 50% of cases where the agent assigns P(A) = 0.5, and ¬A occurs in 60% of cases where the agent assigns P(¬A) = 0.6, the agent is perfectly calibrated on each forecast considered individually. Incoherence resides in the joint structure; calibration is assessed marginally.

This observation generalizes. Dawid (1982) demonstrated that for any outcome sequence, there exist incoherent forecasting strategies that are perfectly calibrated. The construction exploits the fact that calibration is defined over equivalence classes of forecasts sharing the same probability value, without requiring those values to satisfy any relational constraints. An agent can be calibrated on each individual proposition while violating every structural axiom connecting those propositions.

The converse — coherence without calibration — is even more straightforward. Any coherent probability distribution that diverges from the true frequencies is a counterexample. A Bayesian agent with a badly chosen prior can satisfy every axiom of probability while assigning 0.9 to events that occur only 10% of the time. Coherence is a synchronic constraint on the structure of beliefs at a time; calibration is a diachronic constraint on the relationship between beliefs and outcomes over time. These are logically independent dimensions.

The philosophical significance is substantial. Dutch book arguments establish that coherence is necessary for a certain kind of pragmatic rationality — avoiding sure loss. But they say nothing about empirical adequacy. Conversely, calibration arguments establish a form of empirical reliability but permit structural irrationality. An epistemology built on Dutch books alone tolerates persistent empirical failure. An epistemology built on calibration alone tolerates logical incoherence. Neither is acceptable as a complete account of rational belief.

This independence result motivates a conjunctive requirement: rational probabilistic agents should be both calibrated and coherent. But the conjunction raises new questions. Is the conjunction achievable? Under what conditions? Oakes (1985) and subsequent work showed that for deterministic forecasting strategies, perfect calibration is achievable by a coherent agent only under specific structural conditions related to the complexity of the outcome sequence. In adversarial settings, a coherent agent may be unable to guarantee calibration — a result with deep connections to algorithmic randomness and computational learning theory.

Takeaway
Calibration and coherence are logically independent: you can have either without the other. This means no single epistemic virtue — whether Dutch-book consistency or frequency-matching — suffices for rational probabilistic belief. Both constraints are necessary; neither is redundant.

A Formal Framework for Expert Deference

Given the independence of calibration and coherence, what formal criteria should govern deference to expert probability judgments? The framework I propose has three components, each individually necessary and jointly sufficient: coherence, calibration, and domain refinement. The expert's probability function must satisfy the Kolmogorov axioms (coherence), must exhibit acceptable calibration error over a suitable reference class (calibration), and must demonstrate refinement superior to the non-expert's own forecasts within the relevant domain (domain refinement).

The third condition — domain refinement — is where genuine expertise lives. Following the Brier decomposition, we can quantify expertise as the difference in refinement components: REF_expert − REF_novice > 0. This formalizes the intuition that an expert doesn't merely avoid systematic bias; they know more in the sense of being able to make finer probabilistic discriminations. A medical diagnostician who distinguishes 5% risk patients from 40% risk patients provides more epistemic value than one who assigns the base rate of 20% to everyone, even if both are calibrated.

The framework resolves several persistent puzzles in the epistemology of testimony and expertise. First, it explains why track records matter: calibration and refinement are empirically measurable, providing objective grounds for expert evaluation rather than mere credentialism. Second, it explains why expertise is domain-specific: domain refinement is defined relative to a subject matter, so a physicist's probabilistic competence in particle physics confers no authority in epidemiology. Third, it handles disagreement among experts: when two coherent, calibrated experts disagree, refinement scores provide a principled basis for differential weighting.

Formally, let E be an expert and N a novice. Define the deference condition: N should adopt E's probability P_E(A) for proposition A if (i) P_E is coherent, (ii) E's calibration error over the relevant reference class falls below threshold ε, and (iii) E's expected refinement for A-type propositions exceeds N's. Condition (iii) can be operationalized using cross-validation on historical forecasts or, where historical data is unavailable, through structural arguments about E's access to relevant evidence and inferential methods.

This framework also identifies conditions under which deference is irrational. If an expert is coherent but poorly calibrated, their structural consistency masks empirical failure — the epistemological analogue of an internally consistent but empirically false scientific theory. If an expert is calibrated but incoherent, their individual forecasts may be reliable, but any inference combining multiple forecasts will be unreliable. And if an expert is both coherent and calibrated but offers no refinement advantage, deference gains the novice nothing. Each failure mode is formally distinct and diagnostically useful.

Takeaway
Rational deference to expert probability judgments requires three independently verifiable conditions: coherence, calibration, and superior refinement. When any one fails, a specific and identifiable form of epistemic authority breaks down.

The formal relationship between calibration and coherence is one of independence: each captures a genuine epistemic virtue, and neither subsumes the other. This independence is not a defect in our formal tools but a reflection of the genuine complexity of probabilistic competence. Rational belief requires both structural consistency and empirical adequacy.

For the epistemology of expertise, this yields a concrete and measurable framework. Expertise is not a single property but a conjunction — coherence, calibration, and domain refinement — each amenable to formal evaluation. This gives us principled tools for a question that philosophy has historically addressed only informally: when does someone else's probability judgment deserve your trust?

The broader lesson is that formal methods do not merely formalize what we already know. The independence of calibration and coherence, the Brier decomposition, the conditions under which calibration is achievable — these are results that restructure the philosophical landscape. They show that the question of rational belief is richer, and more tractable, than informal epistemology suggested.