http://fivethirtyeight.com/datalab/psychology-is-starting-to-deal-with-its-replication-problem/
—-
The project, spearheaded by the Open Science Collaboration, aimed to replicate 100 studies published in three high-profile psychology journals during 2008. The idea arose amid a growing concern that psychology has a false-positive problem: In recent years, important findings in the field have been called into question when follow-up studies failed to replicate them, hinting that the original studies may have mistaken spurious effects for real ones.
“The idea was to see whether there was a reproducibility problem, and if so, to stimulate efforts to address it,” project leader Brian Nosek told me. In total, 270 co-authors and 86 volunteers contributed to the effort.
This wasn’t a game of “gotcha.” “Failing to replicate does not mean that the original study was wrong or even flawed,” Nosek told me, and the objective here wasn’t to overturn anyone’s results or call out particular studies. The project was designed to conduct fair and direct replications, Nosek said. “Before we began, we tried to define a protocol to follow so that we could be confident that every replication we did had a fair chance of success.” Before embarking on their studies, replicators contacted the original authors and asked them to share their study designs and materials. Almost all complied.
Researchers who conducted the replication studies also asked the original authors to scrutinize the replication plan and provide feedback, and they registered their protocols in advance, publicly sharing their study designs and analysis strategies. “Most of the original authors were open and receptive,” project coordinator Mallory Kidwell told me.
Despite this careful planning, less than half of the replication studies reproduced the original results. While 97 percent of the original studies produced results with a “statistically significant” p-value of 0.05 or less,1 only 36 percent of the replication studies did the same. The mean effect sizes in the replicated results were less than half those of the original results, and 83 percent of the replicated effects were smaller than the original estimates.
These replication studies can’t explain why any particular finding was not reproduced, but there are three general possibilities, Nosek said. The originally reported result could have been a false positive, the replication attempt may have produced a false negative (failing to find an effect where one does exist), or the original study and the replication could both be correct but arrive at disparate results because of differences in methodology or conditions that weren’t apparent.
The best predictor of replication success, Nosek told me, was the strength of the original evidence, as measured by factors such as the p-value. Yes, the p-value — that notoriously misleading statistic. This study suggests that p-values can provide useful information, Nosek said. “If it was good for nothing, it wouldn’t have shown any predictive value at all for reproducibility.”
At the same time, this project’s results serve as a stark reminder that the 0.05 threshold for p-values is arbitrary. “What it suggests is that when we get a p-value of 0.04, we should be more skeptical than when we get a lower value,” Nosek said.
—-
More in Link