In a new study, scientists at Beth Israel Deaconess Medical Center (BIDMC) compared the clinical reasoning capabilities of a large language model with corresponding physicians. The researchers used the revised IDEA (r-IDEA) score, which is commonly used tool to assess clinical reasoning.
The study involved providing a GPT-4 powered chatbot to 21 attending physicians and 18 resident physicians of 20 clinical cases to generate diagnostic reasoning and process them. All three response sets were then assessed using the r-IDEA score. The researchers found that the chatbot actually earned the highest r-IDEA scores, which actually turned out to be quite impressive in terms of diagnostic reasoning. However, the authors also noted that the chatbot was “just wrong” more often.
Stephanie Cabral, MD, the study’s lead author, explained that “further studies are needed to determine how LLMs can best be incorporated into clinical practice, but even now, they could be useful as a checkpoint, helping us to make sure that you don’t lack anything.” In summary, the results showed good reasoning by the chatbot, but also significant errors. This further reinforces the idea that these AI systems are better suited (at least at their current levels of maturity) as tools to augment a physician’s practice, rather than replace a physician’s diagnostic abilities.
As is often explained by physician leaders and technologists alike, this is because the practice of medicine is not based solely on algorithmic results of rules, but rather relies on a deep sense of reasoning and clinical intuition, which is difficult to replicate from an LLM. However, tools like these that can provide diagnostic or clinical support can still be an incredibly powerful asset to a physician’s workflow. For example, if systems can reasonably provide a “first pass” or initial diagnosis suggestion based on available data, such as patient history or existing records, it can allow physicians to save significant time in their diagnostic process. Additionally, if these tools can increase a physician’s workflow and improve the means of processing large volumes of clinical information from the medical record, there may be opportunities to increase efficiency.
Many organizations are exploiting these potential means for clinical augmentation. For example, AI-powered writing technologies leverage natural language processing to help physicians complete clinical documentation more efficiently. Enterprise search tools integrate across organizations and with EMR systems to help physicians search large swathes of data, promote data interoperability, and gather faster and deeper insights into existing patient data. Other systems may even help provide an initial diagnosis. For example, tools are emerging in the fields of radiology and dermatology that are able to suggest a possible diagnosis by analyzing an uploaded photo.
However, there is still much work to be done in this area. Simply put, even though AI systems like these are not ready for clinical diagnostics, there may still be an opportunity to leverage this technology to augment clinical workflows, especially while keeping a human in the loop to ensure safe, secure, and expensive procedures.