Reputable scientific journals assume that the research they publish will be replicated—that is, will yield the same results even when the experiment is repeated by someone else. But when a team of researchers tested this hypothesis in 2015, they found that 60 percent of randomly selected psychology articles from the highest quality journals failed to replicate. Similar patterns were found in economics, biology, and medicine, beginning what has become known as the “reproduction crisis” in science.
How can scientists restore confidence in their findings? Manually repeating all published experiments would be a simple solution, but “it’s completely out of reach,” says Professor Kellogg Brian Uzzi. Instead, since 2015, scientists have identified a technique called “prediction markets” that can predict fertility with high accuracy. But the process only works on small batches of studies and can take nearly a year to complete.
Uzzi wondered if artificial intelligence could provide a better shortcut.
Recent advances in natural language processing—the ability of computers to analyze the meaning of text—had convinced Uzzi that AI systems had “some superhuman capabilities” that could be applied to the reproductive crisis. By training one of these systems to read scientific papers, Uzzi, along with colleagues at Northwestern University Yang Yang and Wu Youywas able to predict reproducibility as accurately as prediction markets — but much, much faster.
This boost in efficiency could potentially give journal editors—and even researchers themselves—an early warning system to assess whether a scientific study will be replicated.
“We wanted a self-assessment system,” says Uzzi. “We start with the belief that no scientist is trying to publish bad work. A scientist could write a paper and then feed it to the algorithm to see what it thinks. And if he gives you a bad answer, you might have to go back and retrace your steps, because that’s a sign that something’s wrong.”
Hidden Signals
To predict whether a scientific study will be replicated without literally repeating the experiment, a reviewer must evaluate the study and look for clues. Traditionally, reviewers did not pay much attention to the wording of the paper. Instead, they inspected the quantitative methods of the experiment itself: the data, models, and sample sizes used by the experimenter. If the methods looked right, it seemed reasonable to assume that the study would be repeated. But in practice, it “turned out not to be very diagnostic” for weeding out flawed investigations, Uzzi says.
Prediction markets retain this basic evaluation strategy of the methodology, but improve its efficiency by asking teams of scientists to immediately review batches of studies in a way that mimics the stock market. For example, 100 reviewers might be asked to make playback predictions for 100 jobs by “investing” in each of them an imaginary budget. Some reviewers will “invest” more in some papers than others, representing greater confidence in the reproducibility of those papers. With an accuracy rate of between 71 and 85 percent, these prediction markets represent the current state of the art in breeding prediction.
Uzzi and his team, meanwhile, started from a very different intuition for predicting reproduction.
Instead of examining the methods and measurements of scientific studies, they rejected this information and looked at what Uzzi calls the “narrative” of a research paper—the description
of in prose.
The idea was inspired by a branch of psychology called discourse analysis, which shows that people unknowingly phrase their sentences differently depending on how certain they are of what they are saying. Uzzi thought that a similar hidden signal might exist in the wording of scientific papers—and that modern machine learning techniques could detect it.
“Imagine a researcher is writing about how an experiment works, and maybe there’s something they’re worried about that doesn’t make it into their consciousness, but still seeps into their writing,” says Uzzi. “We thought the machine might be able to pick up some of that.”
Trust maps
Computers can’t actually read scientific papers, but they can be trained to spot sophisticated statistical patterns between words.
So Uzzi and his colleagues had the AI system turn two million scientific abstracts into a huge matrix showing how many times each word appeared next to every other word. The result was a kind of general, numerical “map” of scientific writing style. Then, in order to train the system to spot potentially problematic studies, the researchers fed it the full text of 96 psychology papers. Sixty percent of them failed to reproduce.
These machine-learned word association maps capture differences in how scientists write when their research is replicated and when it isn’t—with a subtlety and precision that human reviewers can’t match. “When people read text, by the time you’ve read the seventh word in a sentence, you’ve already forgotten the first and second words,” says Uzzi. “The machine, on the other hand, has essentially an unlimited consciousness when it comes to ingesting text.”
The team then tested the system on hundreds of scientific papers it hadn’t encountered before. These cards were all put to the test of manual reproduction: some passed, some failed. When the AI analyzed word associations in these documents, it correctly predicted the playback outcome 65 to 78 percent of the time.
This is roughly equivalent to the accuracy of prediction markets — but with one major advantage. For each batch of 100 documents, prediction markets take months to yield results.
“Our AI model makes predictions in hours,” says Uzzi. For a single paper, it takes a few minutes.
An improvement, not a replacement
However, don’t expect computers to replace human peer review any time soon.
Uzzi stresses that his research is very preliminary and needs further validation. Additionally, the AI system has one major drawback: there is no way to tell exactly which pattern the machine is using to make its predictions.
“That’s still one of the downsides of all artificial intelligence: we’re not really sure why it works the way it does,” he says.
However, while Uzzi and his colleagues could not determine exactly what their system was paying attention to, they were able to show that it was not
fall prey to biases that often plague human reviewers, such as the gender of the author or the names of recognized institutions with which they are affiliated.
To rule out these potential biases, the researchers added this extra information to the AI system’s training data and repeated their experiments to see if this information skewed the results. Including these extra details did not affect the system’s predictions – in practical terms, it ignored them. Furthermore, they found no evidence that differences in discipline, for example social psychology versus cognitive psychology, affected the predicted outcome.
To Uzzi, this sends a good message about his reliability.
“Okay, so we don’t know what the machine is doing—that’s a limitation. At the same time, frankly, it’s easier to de-bias a machine than a human,” he says.
Additionally, Uzzi sees AI as a way to improve scientists’ ongoing response to the reproductive crisis. This is especially important in the era of COVID-19, when some peer review and replication standards have been loosened in an effort to speed up the discovery of a vaccine. An AI-powered early warning system to flag flawed research could help focus the scientific community’s attention on findings important enough to warrant rigorous and expensive manual replication testing.
“We’re currently doing a study to see if our model could help review all these new papers on COVID-19 that are coming out,” says Uzzi. “This will help us identify those papers that create the strongest foundations for new scientific discoveries in this race to find a cure, a cure, or both.”