One way scientists measure this is by showing that a particular study is repeatable, meaning that if the study is run repeatedly using the same methodologies, it will produce consistent results each time.
However, for a variety of complex reasons, when tested, studies often fail to be replicated. Indeed, a 2016 poll of 1,500 scientists showed that the majority believe that science is going through a “reproducibility crisis”, where many published results cannot be reproduced. Replication failure is estimated to be the source of the over $20 billion in losses across science and industry every year.
“That which makes science science it’s that it reproduces,” he says Brian Uzzi, professor of management and organizations at the Kellogg School. “It’s not an accident. Scientific results may be important to advancing science or improving people’s lives, and you want to know which results you can rely on.”
A challenge to understanding the extent of this reproducibility assessment is that assessments of reproducibility tend to be done manually, meaning that individual studies must be repeated with a new set of participants. So reproduction ratings only cover a lot small part of published work. “The limitation of manual replication for understanding general patterns of replication failure,” explains Uzzi, “is that it is expensive and does not scale.”
Knowing this, he and his colleagues tried to design a more scalable approach to reproduction.
They created an algorithm that was able to predict the potential reproducibility of studies contained in a paper with a high degree of accuracy. They then applied their algorithm to more than 14,000 papers published in top psychology journals and found that just over 40 percent of the papers were likely to be replicated, with some specific factors—such as the research method used—boosting the odds. predicted reproducibility.
“We’ve created a powerful, effective tool to help scientists, funding agencies and the general public understand replicability and have more confidence in certain types of studies,” says Uzzi.
Using artificial intelligence to predict playability
The group, which included You Wu at University College London and Yang Yang at Notre Dame, he first had to create an algorithm that was demonstrably good at predicting playability.
They did this by training the algorithm to recognize playing and non-playing cards. They fed the contents of the manually replicated cards into a neural network, which provided “ground truth” data about replicated and non-replicated cards. Once the algorithm was trained to distinguish between replicate and non-replicate papers, its accuracy in predicting replicate studies was verified by testing it on a second set of manual replicate papers that the algorithm had never seen before.
Conventional wisdom in science would suggest that hard numbers within a paper—sample sizes, significance levels, and the like—would be better equipped to predict research reproducibility. But it wasn’t like that. “Numerical values were basically unrelated to reproduction,” says Uzzi.
Instead, they found that training the algorithm on the text of a paper was a more effective way of predicting reproduction. “A paper can have 5,000 words and only five significant figures,” says Uzzi. “Much more information about reproduction could be hidden in the text.”
To test the algorithm’s strength against the best human-based method, the researchers compared it to the gold standard for predicting a paper’s reproducibility, a “forecast market.” This method uses the aggregated predictions of a large number of experts, say 100, who look at a paper and measure how confident they are that it will reproduce in a subsequent manual reproduction test. Prediction markets work well because they tap into the “wisdom of crowds, but they’re expensive to run in more than a few cases,” says Uzzi.
The researchers then used their algorithm to predict the replicability of previous psychology studies that had been manually replicated in laboratories. Their algorithm performed on par with the prediction market with about 75 percent accuracy, “at a fraction of the cost,” Uzzi says.
Examining 20 years of leading psychology papers
Having establish validity of the algorithm, the team then applied it to two decades of psychology work to see which factors correlated with the likelihood of reproducibility.
“We looked at basically every paper that has been published in the top six psychology journals in the last 20 years,” says Uzzi. This included 14,126 psychology-research articles from more than 6,000 institutions, from six subfields of psychology: developmental, personality, organizational, cognitive, social, and clinical.
The researchers looked at potential predictors of replicability, including the research method—such as an experiment versus a survey—the subfield of psychology the research represented, and the amount of media coverage the findings received after publication.
Running the papers through the algorithm yielded a reproducibility score for each paper that represented the probability of successful manual reproduction.
The challenge of reproduction
Like previous manual searchtheir algorithm predicted an overall low level of repeatability across studies: an average reproducibility score of 0.42, representing about a 42 percent chance of successfully replicating.
The researchers then looked at what factors seemed to influence the predicted reproducibility of a paper. Some didn’t seem to matter at all. This included whether the paper’s first author was a junior or senior scholar, or whether they worked at an elite institution. Reproducibility, “has little to do with prestige,” says Uzzi.
So what helps explain playability?
One predictor is the type of study. The experiments had about a 39 percent chance of replication while other types of studies had about a 50 percent chance. This is unthinkable because, as Uzzi notes, an experiment is “the best method we have for verifying causal relationships,” due to the random assignment of participants to conditions and other controls.
However, experiments have their own challenges. For example, many scientific journals only publish experiments that reach a certain statistical threshold. This has led to the “file drawer” problem, where researchers who need to publish studies to advance their careers will (whether consciously or subconsciously) submit only those studies on a given psychological phenomenon that happen to exceed this statistical threshold , remaining mum about studies of the same phenomenon that don’t. Over time, this can lead to a scientific literature that greatly overestimates the strengths of many psychological effects.
In addition, the group of participants may be a factor, as many psychology experiments are conducted on rather homogeneous groups of undergraduate students. “The results may not generalize if they are specific to this subpopulation of undergraduates,” says Uzzi. For example, “there may be something unique about Harvard undergraduates that isn’t replicated with students from the University of North Carolina or somewhere else.”
The algorithm can detect even more subtle predictors hidden in the text of the study that even the authors were not consciously aware of. “When you ask scientists to think retrospectively about why their study might or might not be replicated, they sometimes say things like, ‘The day we ran the experiment it was raining and the participants showed up wet, but we didn’t think that affected the experiment,'” he says. Uzzi. “But when they wrote the study, they may have unconsciously considered the effects of rain, which were expressed as an underestimated change in newspaper semantics that AI can detect.”
Some subfields of psychology fared better than others in terms of replicability. For example, personality psychology was rated with a 55 percent overall probability of reproducibility, and organizational psychology reached 50 percent.
“Organizational psychology is interested in pragmatic results rather than theoretical ones,” says Uzzi. “So their tests and experiments can include things that are more likely to be replicated because you’ll see them often in a real organization, such as how people respond to financial rewards.”
Finally, among published studies, the greatest correlate of failure to reproduce is the degree of media attention a study receives. Although problematic because media attention suggests a significant effect, the finding makes sense. “What really gets the media talking is a surprising finding or even a controversial one,” says Uzzi. “These are the kinds of things that might be less likely to happen again.”
Reproduction and the real world
The findings clearly have implications for researchers and wider society.
“We offer a low-cost, simpler way to test reproducibility that is about as accurate as the best human-based method,” says Uzzi. “Researchers and those who review or use research can use data from our tool along with their own intuition about a document to understand the power of research.”
For example, he says, “if the government wanted to reduce suicides in the military – a lot big problem right now—with a new treatment approach, they can test the study underlying the treatment before a manual replication to help plan treatment implementation.”
Scientists and funders could also use the tool to design studies that are more likely to be replicable in the first place. “It could be used for self-diagnosis,” says Uzzi. “Before someone submits a paper for peer review, they could run it through our algorithm and get a score. It can give them a chance to stop, rethink their approach and possibly identify things that need to be fixed.”
Collaborators are working on a website that would allow researchers to do just that. “It can be particularly useful for one-shot studies, such as those that take place over 10 years or that include subpopulations that are difficult to access” and otherwise not possible to replicate by hand, Uzzi says.
While the researchers focused on psychology, the study’s findings have implications for other fields as well. “Reproduction is also a serious problem in fields such as medicine and economics,” he says. “We can think of psychology as a use case to help researchers in those fields develop similar tools to assess replicability.”