Sample Size and Historical Inference: When Small Data Meets Big Questions

gray asphalt road between green trees under white sky during daytime

9 min read

Every quantitative estimate from limited historical data must be accompanied by an explicit measure of uncertainty, because a point estimate without a confidence interval is essentially meaningless.

Bayesian methods provide a principled framework for combining prior knowledge with sparse new evidence, producing posterior estimates that reflect both what we already know and what the data reveal.

Sensitivity analysis and power calculations are essential for determining whether conclusions are robust or whether individual observations and assumptions are driving the results.

The failure to conduct power analysis leads to a dangerous conflation of insufficient evidence with evidence of no effect, violating the distinction between absence of evidence and evidence of absence.

The highest standard in quantitative history is not extracting conclusions from any dataset, but knowing when the evidence genuinely cannot support the question being asked.

Historical quantitative analysis confronts a fundamental asymmetry: our questions are vast, but our data are often vanishingly small. We want to know how medieval wages responded to plague mortality, how Roman trade networks shaped provincial growth, or how early modern fertility varied by social class. Yet the surviving records—a handful of manorial accounts, a fragmentary tax roll, a parish register with decades of missing entries—can seem laughably thin relative to the populations and processes they are meant to illuminate.

The temptation in such situations runs in two opposite directions. One is to treat small datasets as though they were large ones, computing means and running regressions without acknowledging the enormous uncertainty that surrounds every estimate. The other is to throw up one's hands entirely, declaring that quantitative analysis is simply impossible when the evidence is sparse. Both responses are methodologically lazy. Between false precision and wholesale abandonment lies a rigorous middle ground—one that demands we take uncertainty seriously, make our assumptions explicit, and distinguish clearly between what our data can and cannot tell us.

This article examines three pillars of that middle ground. First, how we quantify and communicate uncertainty when working with limited historical evidence. Second, how Bayesian methods allow us to combine prior knowledge with sparse data in a principled way. Third, how we recognize the genuine limits of inference—the point at which honesty requires us to say we do not know. Each of these is a methodological skill, but also a habit of intellectual discipline that separates productive quantitative history from numerical storytelling.

Uncertainty Quantification: The Discipline of Honest Estimation

When a researcher reports that the average daily wage in fourteenth-century Oxford was 2.3 pence, the precision of that number conceals a wilderness of uncertainty. How many wage observations underlie the estimate? Were they drawn from a single employer or many? Do they represent peak harvest rates or year-round averages? A point estimate without a confidence interval is, in the context of small historical samples, essentially meaningless. The first principle of small-sample inference is that every estimate must travel with a measure of its own uncertainty.

Standard frequentist tools offer a starting point. For a sample of n wage observations with standard deviation s, the 95% confidence interval around the mean is approximately ±1.96 × s/√n. When n is 200, that interval is relatively tight. When n is 12, it can be enormous—wide enough to make the point estimate almost uninformative by itself. Researchers accustomed to modern datasets often find this uncomfortable, but the discomfort is the point. The interval is not a flaw in the method; it is an honest description of how little we know.

Beyond simple means, uncertainty propagates through every downstream calculation. If you divide a price index by a wage index to estimate real wages, the uncertainty in both numerator and denominator compounds. Monte Carlo simulation—drawing thousands of plausible values from each distribution and computing the ratio each time—provides a practical way to track this propagation. The resulting distribution of real-wage estimates is often far wider than researchers expect, and far wider than published tables typically convey.

Presentation matters as much as computation. A table that reports point estimates to two decimal places, with confidence intervals buried in a footnote, invites readers to treat those decimals as meaningful. Better practice puts the interval front and center: not "real wages fell 14.7% between 1340 and 1360" but rather "real wages fell by an estimated 5% to 25%, with a central estimate around 15%." This is less satisfying rhetorically, but it is the only honest representation of what twelve manorial accounts can actually tell us.

A related discipline is sensitivity analysis. When your sample is small, individual observations exert outsized influence on results. Removing a single high-wage entry from a dataset of fifteen observations can shift the mean dramatically. Systematically testing how results change when individual data points are excluded—or when coding decisions are varied—reveals whether a conclusion is robust or fragile. If a finding survives only when one particular observation is included, it is not a finding at all. It is an anecdote wearing a statistical costume.

Takeaway
A number without its uncertainty is not a fact—it is a fiction. In small-sample historical work, the confidence interval is not a technicality; it is the most important part of the result.

Bayesian Approaches: Making Prior Knowledge Work

Frequentist statistics treats each dataset as though it arrived in a vacuum, with no prior information about what the answer might be. For many problems in modern science, this is a reasonable discipline. For historical inference with tiny samples, it is often a handicap. We rarely approach a historical question with zero knowledge. We may know that pre-industrial real wages across northwestern Europe generally fell within a certain range, that urban mortality typically exceeded rural mortality, or that grain prices moved within bounds set by transport costs and subsistence floors. Bayesian inference provides a formal framework for combining what we already know with what the new data tell us.

The mechanics are straightforward in principle. You specify a prior distribution—your belief about a parameter before seeing the data, expressed as a probability distribution. You then update that prior using the likelihood of the observed data to produce a posterior distribution—your revised belief after seeing the evidence. When data are abundant, the likelihood dominates and the prior barely matters. When data are sparse, the prior exerts substantial influence. This is not a flaw; it is precisely the correct behavior. With little new evidence, our conclusions should remain close to what we already knew.

The critical question is where priors come from, and this is where historical expertise becomes indispensable. A researcher studying wages in a newly discovered sixteenth-century account book might set a prior centered on estimates from comparable regions and periods—say, the Allen real-wage database for early modern Europe. The prior should be weakly informative: broad enough to accommodate surprise, but narrow enough to exclude implausible values. Setting a wage prior that assigns equal probability to 0 and 100 pence per day wastes information we genuinely possess.

One powerful application is the estimation of rates from rare events. Suppose a parish register from a small village records three maternal deaths over twenty years. A frequentist point estimate of the maternal mortality rate has a confidence interval stretching from near-zero to implausibly high values. A Bayesian approach anchored by a prior derived from better-documented parishes of similar size and period will produce a posterior that is both narrower and more credible. The result explicitly reflects what we learned from this village in the context of what we know about villages like it.

Transparency is essential. Every Bayesian analysis should report the prior, justify it, and test sensitivity to alternative priors. If your conclusion changes dramatically when the prior shifts from one reasonable specification to another, the data are simply too thin to discriminate between competing hypotheses. This is valuable information in itself—it tells you exactly where additional evidence would matter most, and it prevents the circular reasoning that arises when strong priors masquerade as strong conclusions.

Takeaway
Bayesian methods do not manufacture certainty from thin air. They make your assumptions visible and let you update them honestly. The prior is not bias—it is accumulated knowledge, declared openly and tested rigorously.

Recognizing Limits: When the Evidence Simply Cannot Answer the Question

No amount of methodological sophistication can extract a signal that does not exist. One of the most important—and most neglected—skills in quantitative history is knowing when to stop. Statistical power analysis, routinely applied in experimental sciences, asks a simple question: given the sample size and variability in your data, what is the minimum effect size you could reliably detect? If you need to distinguish a 10% wage decline from a 20% decline, but your sample can only detect differences larger than 40%, you cannot answer the question you are asking.

This matters acutely because the failure to conduct power analysis leads to a particular kind of error: concluding that no effect exists because the data failed to reach statistical significance. The phrase absence of evidence is not evidence of absence is a cliché, but it encodes a real and frequently violated principle. A study that finds "no statistically significant change in mortality after the 1348 plague" based on twelve parish entries has not demonstrated stability—it has demonstrated insufficient data. The honest conclusion is not "mortality did not change" but "our evidence cannot distinguish change from stability."

Conversely, some patterns are so large that even tiny samples can detect them. If every single one of your fifteen wage observations from 1350 exceeds every single one of your twelve observations from 1340, you have strong evidence of an increase even without large samples. The key is matching the ambition of the claim to the resolving power of the data. Small samples can answer coarse questions—did wages roughly double?—but not fine ones—did wages rise by 12% rather than 15%?

A practical heuristic is to ask: what is the smallest dataset that could possibly refute my claim? If the answer is a dataset far larger than anything that could plausibly survive from the period in question, the claim is effectively unfalsifiable with historical evidence, and intellectual honesty requires saying so. This does not mean abandoning the question—it means redirecting effort toward gathering additional data, identifying proxy measures, or framing the inquiry in terms the evidence can actually address.

There is a deeper lesson here about the rhetoric of quantitative history. Numbers carry an aura of authority that narrative prose does not. A table of regression coefficients can silence qualitative objections even when the underlying data are desperately thin. The most rigorous quantitative historians are those who use numbers to discipline their claims rather than to inflate them—who treat statistical methods as tools for discovering what they do not know, not as instruments for manufacturing certainty.

Takeaway
The mark of a mature quantitative historian is not the ability to extract conclusions from any dataset, but the willingness to declare when the data cannot support the question being asked.

Small-sample inference in historical research is not a lesser form of quantitative analysis—it is a more demanding one. It requires every assumption to be stated, every uncertainty to be measured, and every conclusion to be calibrated against the resolving power of the evidence. The tools exist: confidence intervals, Monte Carlo propagation, Bayesian updating, power analysis. What is often lacking is the discipline to use them honestly.

The payoff for that discipline is substantial. Conclusions grounded in transparent uncertainty quantification are more credible, more durable, and more useful to subsequent researchers than point estimates presented with false confidence. They also focus attention on where new evidence would matter most—a direct guide to productive archival work.

Quantitative history done well does not pretend to know more than the data allow. It takes the fragmentary record seriously on its own terms, extracts what it can, and is forthright about the rest. That restraint is not a weakness. It is the foundation on which cumulative knowledge is built.