Every year, the number of films Nicolas Cage appears in correlates almost perfectly with the number of people who drown in swimming pools. The relationship is statistically striking—and completely meaningless. Yet headlines routinely treat correlations as if they prove cause and effect, leading readers to conclusions the data simply cannot support.
Correlation is one of the most useful tools in scientific research, and one of the most misunderstood. It tells us that two things vary together, but it stays conspicuously silent on why. Understanding this distinction isn't just academic pedantry—it's the difference between identifying genuine health risks and chasing statistical phantoms.
The good news: scientists have developed rigorous methods for building causal arguments from correlational data. The key is knowing exactly what correlation measures, recognizing the traps that create false associations, and understanding the additional evidence needed before claiming one thing actually causes another.
Correlation Mechanics
A correlation coefficient—typically denoted as r—quantifies how two variables move together. Values range from -1 to +1. A coefficient of +1 means perfect positive correlation: as one variable increases, the other increases proportionally. A coefficient of -1 indicates perfect negative correlation: as one rises, the other falls. Zero means no linear relationship exists.
The critical word here is linear. Correlation coefficients specifically measure straight-line relationships. Two variables could have a strong curved relationship—think of anxiety and performance, which often follows an inverted U-shape—and still show a correlation near zero. The math captures only one type of pattern.
Strong correlations can emerge from pure coincidence, especially in datasets with many variables. If you measure enough things simultaneously, some will correlate by chance alone. This is why researchers use statistical significance testing—to estimate the probability that an observed correlation could arise from random variation. But even statistically significant correlations prove nothing about causation.
Consider the robust correlation between ice cream sales and drowning deaths. Both rise in summer, creating a genuine statistical relationship. But eating ice cream doesn't cause drowning, and drowning doesn't drive ice cream purchases. The correlation is real—it will replicate in new data—but the implied causal link is an illusion created by a shared underlying factor.
TakeawayCorrelation coefficients measure only linear relationships between variables. A strong, statistically significant correlation tells you the pattern is unlikely to be random noise, but it cannot tell you whether one variable influences the other.
Third Variable Problem
The ice cream and drowning example illustrates the third variable problem—also called confounding. Both variables correlate because they share a common cause: warm weather. Summer heat drives people to buy ice cream and to swim, creating a statistical shadow of a relationship that doesn't exist directly between the two.
Confounders can be subtle and counterintuitive. Studies once found that moderate alcohol consumption correlated with better health outcomes than complete abstinence. This sparked debates about alcohol's potential benefits. Later research revealed a confound: the 'abstainer' group included former heavy drinkers who quit for health reasons and people too sick to drink. Once researchers controlled for these factors, alcohol's apparent protective effect largely vanished.
Observational studies—where researchers measure variables without experimental manipulation—are particularly vulnerable to confounding. Unlike randomized experiments, which distribute confounders equally across groups, observational research inherits all the messy correlations present in natural populations. Researchers use statistical techniques like regression adjustment and matching to control for known confounders.
The deeper problem is unknown confounders—variables researchers haven't measured or haven't conceived of. No amount of statistical adjustment can control for what you haven't identified. This is why a single observational study, no matter how large or well-designed, cannot definitively establish causation. It can suggest, it can strengthen suspicion, but it cannot prove.
TakeawayWhenever you encounter a correlation, ask: what third variable could cause both? Confounders create statistical relationships between variables that have no direct influence on each other, and unknown confounders lurk in every observational study.
From Correlation to Causation
Scientists don't abandon correlational evidence—they build causal arguments by accumulating multiple independent lines of support. The epidemiologist Austin Bradford Hill formalized this approach in 1965, proposing criteria for evaluating causal claims from non-experimental data. These principles remain foundational today.
Temporal precedence is the first requirement: the cause must precede the effect. Longitudinal studies that track people over time can establish this sequence. If high blood pressure measured in 2010 correlates with heart attacks in 2020, the temporal order supports a causal interpretation—though it still doesn't prove it.
Dose-response relationships strengthen causal arguments. If more exposure leads to more effect—more cigarettes correlate with higher lung cancer rates—the pattern suggests a causal mechanism at work. Similarly, consistency across different populations, time periods, and research methods adds credibility. When the same correlation appears in Japanese women, Swedish men, and Brazilian teenagers, coincidental confounding becomes less plausible.
Mechanistic plausibility provides perhaps the strongest support. Understanding how a cause produces an effect—the biological pathway by which tobacco smoke damages lung tissue, for instance—transforms correlation from suspicious pattern to credible hypothesis. When statistical association aligns with known mechanisms, the causal case grows compelling. Scientists rarely claim causation from any single study; they build it from converging evidence across multiple approaches.
TakeawayCausal arguments from correlational data require multiple supporting threads: the cause must precede the effect, larger exposures should produce larger effects, the pattern should replicate across contexts, and a plausible mechanism should explain how the cause produces the effect.
Correlation is a starting point, not a destination. It identifies patterns worth investigating and rules out relationships that don't exist statistically. But the journey from 'these things vary together' to 'one causes the other' requires careful additional work.
When you encounter correlational claims—in research papers, news articles, or casual conversation—apply the critical questions. Could a third variable explain this? Does the timeline fit? Is there a plausible mechanism? Is this pattern replicated elsewhere?
These questions don't require advanced statistical training. They require only the recognition that our pattern-seeking minds easily mistake coincidence for causation, and that science has developed tools to tell the difference.