Imagine you're planning to build a bridge across a river. Before committing millions to construction, you send a small team to test the soil, check the weather patterns, and see if your equipment can handle the terrain. That scouting mission doesn't tell you whether the bridge will hold traffic—it tells you whether building it is feasible.

Pilot studies serve exactly this role in research. They're scouting missions—small, preliminary investigations designed to test whether a larger study can realistically be carried out. Yet across biomedical research, psychology, and education, pilot data gets routinely misused. Researchers treat tiny preliminary samples as if they've already answered the scientific question.

The consequences are significant. Inflated effect sizes from underpowered pilots feed into flawed sample size calculations, producing main studies that are themselves too small. Promising-looking preliminary results get published as standalone evidence, entering the literature without the rigor they were never designed to provide. Understanding what pilot studies can and cannot do is one of the most practical statistical skills a researcher—or a reader of research—can develop.

Feasibility vs. Effect Size: Two Very Different Questions

The fundamental confusion surrounding pilot studies comes from conflating two distinct questions. The first: Can we actually run this study? The second: Does this treatment work? A pilot study is designed to answer the first question. It tests logistics—recruitment rates, protocol adherence, measurement reliability, dropout patterns. It is not designed, and not statistically equipped, to answer the second.

Yet the temptation is enormous. You've run a small study with 20 participants, and the treatment group improved more than the control group. The p-value is 0.04. It's natural to feel you've already found something. But that p-value emerged from a study that was never powered to detect a meaningful effect. With only 20 participants, random variation dominates. The apparent effect could easily be noise masquerading as signal.

The distinction matters because these two questions require fundamentally different designs. Feasibility assessment needs qualitative and process-oriented data: How long did recruitment take? Did participants understand the instructions? Were there unexpected practical barriers? These answers don't require large samples. But estimating whether an intervention works—and by how much—demands statistical power, which demands adequate sample size. A pilot, by definition, lacks this.

When researchers blur this boundary, the pilot becomes a Trojan horse. It enters the literature looking like evidence of effectiveness when it was only ever evidence that the study machinery functions. Reviewers and readers who don't recognize this distinction can be misled into treating preliminary feasibility data as proof of concept—a category error with real consequences for how resources and attention get allocated in science.

Takeaway

A pilot study answers 'Can we do this study?' not 'Does this work?' Treating feasibility data as efficacy evidence is one of the most common and consequential errors in research design.

The Unreliability of Effect Size Estimates from Small Samples

Here's where the statistics get genuinely treacherous. Many researchers conduct a pilot study, observe an effect size—say, a Cohen's d of 0.8—and then use that estimate to calculate how many participants the main study needs. This sounds reasonable. It's also deeply flawed. Effect size estimates from small samples are extraordinarily imprecise, with confidence intervals wide enough to be practically meaningless.

Consider a pilot with 15 participants per group that yields an observed effect size of d = 0.5. The 95% confidence interval around that estimate stretches roughly from -0.2 to 1.2. The true effect could be nonexistent, modest, or large. Using that point estimate of 0.5 to power your main study is like using a single day's temperature to predict the annual climate. You might get lucky. You probably won't.

Worse still, small samples systematically inflate observed effect sizes—a phenomenon sometimes called the winner's curse. If your pilot only reaches statistical significance when random variation happens to push the estimate upward, the "successful" pilots that proceed to main studies carry inflated expectations baked in. The main study, powered to detect an exaggerated effect, ends up underpowered for the true, smaller effect. The result: a well-intentioned but statistically doomed replication attempt.

Simulation studies have demonstrated this clearly. Researchers who power main studies based on pilot effect sizes consistently produce underpowered research. Better approaches exist—using effect sizes from meta-analyses, specifying the smallest clinically meaningful difference, or using safeguard estimates that correct for small-sample bias. The pilot's job isn't to estimate the size of the effect. It's to ensure you can measure it properly when the real study begins.

Takeaway

Effect size estimates from small pilots are too imprecise to reliably power main studies. The confidence intervals are vast, and the winner's curse means the estimates that look promising are the ones most likely to be inflated.

What Pilot Studies Actually Do Well

Strip away the misuses and pilot studies are genuinely valuable—arguably indispensable—for the things they're actually designed to do. They answer the practical questions that no amount of theoretical planning can resolve. Will participants tolerate a 90-minute assessment battery, or will they drop out after 60 minutes? Can research assistants deliver the intervention consistently? Does the randomization procedure work in practice? These are questions about process, not outcomes.

Pilot studies also reveal measurement problems that would otherwise sabotage a full trial. A questionnaire might produce ceiling effects in your specific population. A biomarker assay might have unexpected variability with your sample storage conditions. An outcome measure might be culturally inappropriate for your target demographic. Discovering these issues in a pilot costing $20,000 is vastly preferable to discovering them in a main study costing $2 million.

They're also essential for estimating recruitment and retention parameters—not effect sizes, but the logistical numbers you need for planning. If you can recruit 5 participants per month from a given clinic, your 200-person trial will take over three years. That's critical information for grant applications, staffing decisions, and timeline commitments. A pilot gives you those numbers with reasonable precision even at small scale.

The key is reporting pilot studies honestly. The CONSORT extension for pilot trials explicitly recommends against hypothesis testing in pilots and encourages researchers to focus on feasibility objectives with clear success criteria defined in advance. A well-conducted pilot doesn't conclude "the treatment works." It concludes "a definitive trial is feasible, with the following modifications to the original protocol." That's not a lesser contribution—it's a different and essential one.

Takeaway

The real value of a pilot study lies in stress-testing your research machinery—protocols, measures, recruitment, and logistics. These process-level insights are what make the difference between a main study that runs smoothly and one that collapses under avoidable problems.

Pilot studies occupy a critical but frequently misunderstood position in the research pipeline. They are dress rehearsals, not opening nights. Their value is enormous when used to refine protocols, identify measurement issues, and estimate practical parameters like recruitment rates and dropout patterns.

The danger comes from asking them to do what they cannot—provide reliable estimates of treatment effects or definitive evidence of efficacy. Small samples produce noisy estimates, and noisy estimates make poor foundations for major studies.

Next time you encounter a pilot study—whether you're designing one, reviewing one, or reading about one—ask a simple question: Is this answering a feasibility question or an efficacy question? If the answer is efficacy, the sample almost certainly isn't large enough to trust the conclusion.