Science doesn't need to be so complicated. The answer: more sensible statistics

Let the battle between human psychology and science have statisticians' supervision

Image:

Tim / Flickr

I’m not much of a magician, but I do have one trick up my sleeve. I spread a deck of cards in front of a friend and say: “Pick a card, any card.” Immediately after he looks at the card, I say: “Three of clubs.”

Usually, it’s not the three of clubs. The trick flops, and we have a good laugh. But about once in every 52 attempts, by sheer luck, I guess correctly. The most important part of this trick is never to repeat it more than once on any single person. Even if you’re lucky enough to guess correctly on the first try, a failed replication will break the spell.

Curse of the p-value

We assume that scientific findings, unlike my card trick, aren’t the result of chance alone. As such, one basic test to pass scientific muster is that findings can be replicated by repeated experiments. However, for the better part of a decade, a fear has been brewing that it all might be a house of cards – that many, or even most, published scientific findings are false. Fear of a "replication crisis" in science – so named because of demonstrations that many scientific findings cannot be replicated – isn’t just the opinion of a few climate change-denying flat-earthers. Rather, it is arguably the majority opinion among scientists themselves. In a study conducted by Nature in 2016, 52 percent of surveyed researchers believed that there was a “significant crisis” at hand, with only 3 percent flatly denying any crisis at all.

Magic and/or statistics

 Farhan Siddicq / Unsplash

Though the culprits of the replication crisis are many, mishandling and misinterpreting statistical evidence has long been recognized as one of the kingpins. So to most statisticians, a solution was obvious: better statistics education. But at the heart of good statistics is a deep suspicion in any "obvious" truths: all hypotheses must be tested. Blakeley McShane of Northwestern University and David Gal at the University of Illinois at Chicago decided to put this obvious solution to the test. If a deeper understanding of statistics can indeed guard a scientist from making errors of judgement, they reasoned, then surely statisticians should make fewer such errors.

What they found was disheartening. They gave two groups a set of questions designed to gauge the degree to which they could be misled by the p-value, a statistical quantity both very common in scientific research and very prone to misinterpretation. Though one group was comprised of professional statisticians and the other was not, both performed poorly. Adding insult to injury, all of the statisticians included in the study were authors of articles published in the prestigious Journal of the American Statistics Association, the very journal in which McShane and Gal would publish their findings.

More than a mild point of embarrassment for the statistical community, this finding casts serious doubt on any solution to the replication crisis predicated on better statistical teaching. If highly educated statisticians fall prey to the same statistical fallacies as their non-statistician colleagues, then perhaps the source of statistical error lies far away from what education can correct. Rather than statistical naiveté as the root cause, the study seems to point deep down to human psychology.

'Is your finding real?'

The genesis of the replication crisis can be traced back to 1925, when an off-hand suggestion made by statistics giant Ronald Fisher arguably altered the entire practice of science. The suggestion was nothing more than a “rule of thumb” regarding the use of one of his many statistical inventions: the p-value. A p-value is a number between 0 and 1 that you can compute from your data. It gives some limited insight – though far from a complete picture – of the strength of your finding. Fisher’s rule of thumb said: if the p-value is under 0.05, the finding is “statistically significant.”

Fischer and his 'statistically significant' beard

Public domain

Fisher never intended this to be taken too seriously, but his rule of thumb soon became gospel. It provided a simple procedure that could seemingly be applied to any study and allowed scientists to blur away the hairiness of experimentation and give a simple “yes or no” answer to the impossibly complicated question, “Is my finding real?” Perhaps most vital to its appeal, the term “statistically significant” has a seductive veneer of objectivity and rigor. All this made Fisher’s humble suggestion spread like wildfire through nearly every scientific discipline. As professor Jefferey Leek jokes: “If [Fisher] was cited every time a p-value was reported his paper would have, at the very least, 3 million citations – making it the most highly cited paper of all time.”

This false sense of certainty with which researchers can categorize data as "good" or "bad," "statistically significant" or "not statistically significant," solely based on the p-value is referred to as the problem of "dichotomization of evidence." Such a procedure is akin to making a college admissions decisions based only on the reading portion of the SAT. Sure, this number tells you something about the applicant's aptitude, and it's certainly better than nothing. But seeing how you have a plethora of other information at hand – her GPA, her extracurricular activities, even her scores on the other sections of the SAT – it would seem ridiculous to make a complex decision from a single number that obviously gives only a partial picture.

The gorilla suit test

The survey that McShane and Gal administered to their subjects appeared to be an unassuming statistics quiz. In fact, it was closer in spirit to the classic “gorilla suit” psychology experiment (if you haven’t seen it, it’s worth the two-minute detour). In the experiment, a short video is played in which a few children are passing a basketball between themselves and the subject is instructed to count the number of basketball passes that occur. When the video is finished, the subject is asked how many passes she observed and whether or not she noticed anything unusual. A typical subject, in her effort to carefully count basketball passes, will completely miss the fact that a person in a gorilla suit walked into the middle of the screen, danced around, and left.

In similar fashion, McShane and Gal blinded their subjects from the obvious by distracting them with gratuitous p-values, some above the infamous 0.05 threshold and some below. The authors found that both statisticians and non-statistician scientists alike performed quite differently depending on whether or not the p-value was “significant” according to Fisher’s rule of thumb. (I read the questions and was duped like everyone else.)

What this implies, at least according to the authors, is that Fisher’s rule is embedded so deeply into the psyche of all scientists – statisticians or not – that no amount of statistical training can unearth it.

The fact that a typical human misses the dancing gorilla won’t be changed by a better education on gorillas or basketball. We accept this psychological blindspot as an unavoidable fact of human nature and try to arrange our world in such a way that this problem is minimized. With this knowledge that humans are terrible at multi-tasking, we know to avoid it in many scenarios. Texting and driving, for instance, is largely illegal.

Likewise, psychologists have known for a while that human intuition is garbage at statistics. Trying to educate away this innate shortcoming may be a fool's errand. Rather, scientists should be trying to minimize the number of statistical calculations people have to make to understand their work. 

This is not at all to say that science should divorce itself from statistics. Instead, we need better statistics – statistics designed to guard against our faulty intuition rather than preying on it.

Peer Commentary

Feedback and follow-up from other members of our community

P-values, for now, seem to be a necessary evil. Over-focusing on them obscures other important statistical information, like effect sizes (how much of a difference did the treatment make?) and confidence intervals (if the treatment does make a difference, what’s the range of values we would expect to see in the variable we measured?) But right now it’s still difficult to publish papers if you don’t include p-values. 

I tried exactly that in a recent paper and it got sent back - for many reasons, mind you! - but one was that the reviewers wanted to see p-values instead of just confidence intervals. This matters because even if a result does not have a significant p-value, it can still be meaningful. For example, if I find that trees planted on clay soil grow twice as fast as the same trees planted in sand, that’s useful information for someone trying to replant a forest, even if the p-value for my study is 0.1. It’s up to the person planting the trees to decide. Right now that hypothetical research could be filtered out by a journal reviewer who only sees a non-significant result.

Oh boy, the dreaded p-value. As someone who’s spent a lot of time in math and statistics classes, I guess what I have noticed is there is a blurring of what the p-value “means” to humans, and how that confusion can lead to false claims. 

The deck of card example here can be in a way misleading. While the probability of guessing correctly is one in 52, a p-value to explain the same experiment across many friends would require a hypothesis on the expected outcome of the test, which we would all reasonably assume would be that you would not guess correctly if you performed this trick on a group of friends. So, the p-value gives you insight to your expectation based on the data, but not the true numerical values that represent reality. 

The hard part is linking reality and math. The psychology, intuition, and math of this all is quite hard to separate in an academic setting. And in a world where many scientific data sets are concerned with the mean values, I think rather that avoiding the p-value, we should be required to show confidence intervals, p-values, effect sizes, and the raw distribution of the data to make that connection from math to “is this what we see in reality” easier, rather than rely on a single computed value to judge the validity of all the studies in the universe.