The p Value Controversy

The omnipresent p value has gotten quite the beating in the statistical press for decades, and it’s still a thorny issue for many researchers. From time to time we witness donnybrooks carried out in journal letters or pointed articles, with authors defending some statistical practices while attacking others. There are compelling arguments that suggest that additional statistical measures, like effect size and 95% confidence intervals, help contextualize p values. Another potential solution to the weakness of the p value as a measure of research importance would be making the cutoff for significance much lower, like 0.01 or even 0.005.

We also know that the pressure to publish, and the lack of outlets for null findings (the so-called “file drawer” problem), contributes to biased scientific literature. There is a suspicious distribution of p values in published articles, with an unusual concentration of p values barely below 0.05. This suggests that p values are being manipulated (p-hacking) by selective analysis or that data “fishing expeditions” are being carried out, with the measurements that happen to be significant being treated as original hypotheses (HARKing, or Hypothesizing After Results are Known). Perhaps the addition of other statistical metrics would make this kind of poor research practice less attractive, or (what seems more likely), would simply make p-hacking and HARKing a bit more difficult but no less common.

Leaving aside the problem of research incentivization and irresponsible research practices, we find that underlying most statistical disputes is the question of whether a research outcome matters. The most commonly used statistical tool in our area is the t-test, which is sensitive to sample size, and is capable of finding true small differences between groups, when the sample size is large. If a change in diet creates a statistically significant reduction in hypercholesterolemia, but the effect size is minimal, does the research on this diet even matter? Would something as radical and life-altering as a permanent alteration in diet be worth a (possible) small drop in LDL? When we graph the distribution of our populations in a case like this, we see that the vast majority of the area under our curves overlaps – only a few dieters would get a reduction in LDL beyond what would be expected in the dispersion of non-dieters.

Compounding matters is poorly nuanced reporting by popular press outlets. A measured, careful article that includes caveats about the need for further study and the clinical applicability of the findings it reports can be latched onto by a newspaper or news program which breathlessly reports its tenuous finding as breakthrough new science. When, inevitably, the needle swings the other way and other studies find an opposite effect, or fail to reproduce the findings, the popular press reports this, too, with the same lack of statistical rigor. The public gets fatigued and begins to view science, and even facts, as mutable and chancy.

All this points to the importance of researcher responsibility in carrying out and reporting research. By all means, if you have a finding with a significant p value but small effect size or broad confidence interval, report it! Your finding, which boasts statistical significance but lacks other convincing evidence of its actual usefulness, may be a hint that points in the direction of something more substantial. If you have a serendipitous finding – a statistically significant difference in a measure you didn’t hypothesize about – great! Be honest about the fact that this was a finding that wasn’t predicted in your hypothesis and that this difference merits more focused study. Be aware of the limits of your findings, and understand what your statistical tests are really telling you. Go beyond the math – visualize your findings, imagine the application of your findings to clinical care, try explaining the effect size in ordinary, non-scientific terms. Being transparent about what you expected, what you found, and why it matters is the basis of good science.