These doctor-machine partnerships hold particular promise for dermatology, a specialty in which diagnosis often comes down to recognizing the visual characteristics of a disease—something that deep learning systems (DLS) can be trained to do with great accuracy.
There is still hope that machine learning could help address a known problem in the field: Only 10 percent of images in dermatology textbooks depict patients with darker skin, meaning doctors may not be familiar with the different ways in which diseases can appear on different skin tones;
New research by Matt Groh, assistant professor of management and organizations at the Kellogg School, tests the question of machine dermatology by looking at how suggestions from deep learning systems affected doctors’ photo-based diagnoses. The research was authored by dermatologists Omar Badri, Roxana Daneshjou and Arash Koochek. Caleb Harris, P. Murali Doraiswamy, and Rosalind Picard of the MIT Media Lab. and Luis R. Soenksen of the Wyss Institute for Bioinspired Engineering at Harvard.
“So the question was, does a dermatologist and AI assistance do better or not?” explains Groh. The researchers looked not only at overall accuracy levels, but also at fairness—whether accuracy levels increased evenly across images of lighter and darker skin.
The results were mixed. Help from even an imperfect deep learning system increased the diagnostic accuracy of dermatologists and general practitioners by 33 percent and 69 percent, respectively. However, the results also showed that, among GPs, DLS exacerbated the differences in accuracy in light and dark skin. In other words, the AI-powered generalists got much better at making correct diagnoses on lighter skins, but only slightly better on darker skins.
For Groh, the results suggest that machine learning in medicine is powerful but not a magic bullet. “AI in healthcare can really help improve things,” he says. “But it matters how we design it—the interface at which AI develops on people, how AI performs on different people, and how professionals perform on different people. It’s not just about artificial intelligence. It’s for us.”
Providing robotic consultation to doctors
For the study, Groh and his colleagues curated a set of 364 images representing different skin conditions in patients with a variety of skin tones. The researchers made sure to include conditions that look different on light and dark skin—Lyme disease, for example, generally appears as a red or pink rash in people with light skin, but can appear as brown, black, purple, or even pale. . -white on people with darker skin.
In the experiment they used two different deep learning systems. The first, which was trained on images without any intervention from the researchers, had an overall accuracy rate of 47 percent and was designed to mimic today’s machine learning dermatology tools in development. The second had been enhanced by the researchers to achieve 84 percent accuracy and was an attempt to simulate the more accurate tools likely to be available to doctors in the future.
The researchers used Sermo, a networking site for health professionals, to recruit more than 1,100 doctors for the study. Participants included dermatologists and dermatologists, as well as general practitioners and other medical specialists.
Including both specialists and general practitioners was important, Groh says, given that “general practitioners often see skin conditions because it’s hard to book dermatologists. Many times, you might want to talk to a general practitioner before seeing a specialist.”
Participating doctors went to a website where they answered a series of questions about their experience diagnosing skin conditions in patients with different skin tones. They were then presented with ten different photographs of skin conditions and asked to make their top three diagnostic guesses for each, mimicking the differential diagnosis process that doctors use in their real practices. If doctors guessed wrong, they saw a suggested diagnosis from the deep learning system and were given the chance to update or keep their own diagnosis.
Participants were randomly assigned to receive recommendations from either the less accurate control DLS or the more accurate treatment DLS. (While they had been told at the start of the study that DLS was not completely accurate, the doctors did not know the overall accuracy rates for either system.)
The researchers also varied how they prompted doctors to update their diagnosis. Half were shown a “keep my differential” button as the first of three options, while the other half saw “update my top prediction” first — a more forceful prompt to accept the AI’s suggestion.
Understanding diagnostic accuracy and fairness
Across all skin conditions and skin tones in the experiment, dermatologists, dermatologists, general practitioners and other doctors were 38%, 36%, 19% and 18% accurate, respectively, meaning they included the correct diagnosis in the three their guesses. Top-1 accuracy—the accuracy of the diagnosis listed first—was 27 percent, 24 percent, 14 percent, and 13 percent, respectively.
While these numbers may seem low, Groh says it’s important to remember that the experiment was very limited—much more so than real-world teledermatology would typically be. “It’s a difficult task when you only have one image, no clinical history, no photos in other lighting conditions,” he explains.
Doctors’ diagnostic accuracy dropped even further when the researchers limited their analysis to darker skin. Among generals, minimal experience with darker-skinned patients had a particularly deleterious effect: primary care providers who reported seeing mostly or all white patients were 7 percentage points less accurate in dark than light skin.
And how did DLS change things?
Even the less accurate screening system significantly increased accuracy: top-1 accuracy increased by 33 percent among dermatologists and dermatology residents and by 69 percent among general practitioners. Not surprisingly, the more accurate DLS processing increased these numbers even further.
“Ultimately, they’re making better decisions while not making as many mistakes,” says Groh — meaning doctors aren’t getting inaccurate suggestions from DLS.
Among dermatologists and dermatology residents, DLS supports increased accuracy relatively evenly across all skin tones. However, the same was not true for generalists—their accuracy increased more on light skin tones than on dark ones.
Interestingly, how physicians were asked to adjust their diagnoses in response to DLS feedback had a significant effect on accuracy. When doctors were shown “updating my top prediction” first on the list compared to the previous one, their top-1 accuracy increased significantly.
These results suggest the critical importance of planning. “These little details,” says Groh, “can make big differences.”
The right kind of doctor-machine collaboration
Groh says another important takeaway from the research is how difficult it is to diagnose skin conditions from photographs alone. Relatively low overall accuracy rates provide “some sense of how much information is in an image,” he says. “It’s really imperfect.”
The fallibility of images means that the best AI-powered dermatology may look different from how we currently imagine it. Until now, many doctors and computer scientists assumed that the optimal approach would be to teach the system to produce a single diagnosis from an image. But perhaps, Groh says, it would be more useful to train a DLS to generate lists of possible diagnoses—or even to generate descriptions of the skin condition (such as its size, shape, color, and texture) that would they could guide doctors diagnoses.
Ultimately, research shows that DLSs and physicians are more powerful together than alone. “It’s all about growing people,” says Groh. “We’re not trying to automate anyone away. We’re actually trying to understand how augmentation might work, when and where it will be useful, and how we should plan augmentation appropriately.”