“AI that best imitates people in general look like a good thing,” he says Blake McShaneProfessor of Marketing at Kellogg. “But when AI mimics human mistakes, this is obviously bad when accuracy is the goal.”
People tend to see the world as a partition, instead of constant. This black and white way of thinking also applies to science, for example, when researchers apply arbitrary limits to their results-a approach that can lead to errors in interpretation.
In a new study, McShane and two colleagues, David Gal and Adam DuhakokFrom the University of Illinois Chicago he found that AI models fall victim to these errors as well as human researchers.
“Since AI models” learn “from the human text and that people make these mistakes all the time, they are in danger that AI models will do the same,” says Mcshane.
‘Statistical significance’ in scientific practice
Researchers have long been based on statistical tests to interpret the results of a study. One of the most popular tests, the Null Humposis Section test provides a measure known as a P.-Price falling between zero and one. Conventionally, researchers consider their effects “statistically significant” when P.-The value is below 0.05 and “statistically not significant” when it is above it.
A cognitive error often comes with this partition: researchers mistakenly interpret “statistical significance” as proving the effect they study and “statistical non -significant”, proving that there is no effect.
Composition themes, the 0.05 threshold has become a kind of tower for publication of research. Studies that report “statistically significant” results are much more likely to be published than they do not do, even if they P.-The prices are almost the same. This results in a biased literature. It also encourages harmful research practices that push the P.-The price on the desired side of the threshold.
For P.-The price is a continuous measure of evidence, says McShane, a P.-The price just above the 0.05 threshold is essentially identical to one just below it. But it’s even more difficult than that, he says. In addition to being constant, P.-The prices naturally vary much from study in study. Therefore an initial study with one P.-The price of 0.005 and a 0.19 playback study is completely compatible with each other -ly the first P.-The value is well below the 0.05 threshold and the second well above it.
However, his previous work With Gal he found that most researchers blindly cling to the arbitrary 0.05 “statistical” threshold, treating the results as black and white and not constantly.
As a man, like the ai
McShane and his colleagues investigated whether AI Chatgpt, Gemini and Claude models, such as humans, adhere to the 0.05 “statistical” threshold in interpreting statistical results demanding these AI models to interpret the results of three different hypothetical experiments.
The first concerned the survival rates between patients with terminal cancer. Patients in this experiment were assigned to one of the two groups: Group A, where they wrote daily about positive things with which they were blessed with Group B, where they wrote daily for the misfortunes of others. The results of this experiment were that, on average, patients in Group A lived for 8.2 months after their initial diagnosis, compared to 7.5 months for patients in Group B.
After presenting this information to AI models, their researchers asked which of the following four options provided the most accurate summary of the results: On average, patients in Group A lived greater post-diagnosis than Group B. On average, patients in Group B lived. or cannot be determined which group he lived for the most. Researchers asked each AI model to answer this question but differentiated the P.-The value that compares the two groups from a “statistically significant” 0.049 in a trivial but “statistically non -significant” 0.051.
There was a clear division into the way in which AI models responded according to the P.The price: They almost always responded that Group A lived longer when it was 0.049 (“statistically significant”), but did much less often when it was 0.051 (“statistically not significant”).
“The answers were different when the 0.05 threshold was crossed,” says McShane. “A little change in the entrance resulted in a big change in the exit.”
The researchers have faced the same result for the two other hypothetical experiments. For example, in one about the effectiveness of the drugs – where the results for drug A were more promising than the drug B – they asked the AI models if a patient would be more likely to recover if given the drug A or drug B. P.-The price was 0.049 but very rarely when it was 0.051.
In all these experiments, the results are closely reflecting what happened when academic researchers answered the same questions before studies. Where the P.-The value was in relation to the 0.05 “statistical” limit strictly played a key role in shaping the way in which AI models and models interpreted the results.
AI models even rely on “statistical significance” absence a P.-value. ‘We carried out some tests where we didn’t give P.-However, the answers would give in any case the “statistical significance”, says McShane.
A warning
Researchers extended the study with the diet of AI prompts with an explicit directive by the US Statistical Union warning against the based P.-These thresholds in the interpretation of quantitative results. Despite the guidance, AI models continued to respond dichotomously, responding to a way when the P.-The price was 0.049 and another way when it was 0.051.
Even the strongest and most recent models were sensitive to it. Chatgpt, for example, has released a new version of the AI model, while McShane and his colleagues conducted this project – a model designed to disperse the problems to smaller elements and himself through answers. This up -to -date AI model responded even more dichotoma than the older models.
“I can’t say for sure why this is, but if I had to grieve, it may be because these newer and larger models mimic human answers more effectively,” says McShane. “If this is the case, then the closer these AI models get to the creation of a human -produced text, the more their answers have to fall into traps that people fall, or around ‘statistical significance’ as in our research or possibly broader.”
For McShane, these results increase red flags as people in academia and other industries incorporate AI with more autonomy into more dimensions of their work. He noted that researchers are already using the AI to summarize the documents, carry out reviews of the literature, perform statistical analyzes and even pursue new scientific discoveries. And yet every model who examined him and his colleagues showed a systematic inability to properly interpret the basic statistical results – a seemingly necessary prerequisite, says McShane, in all these other projects.
“People are asking AI models to do things that are much more complicated than the basic small multi -choice questions we asked,” he says, “but if they do so irregularly in our questions, it raises doubts about their ability to do so much more ambitious tasks.”
