Written by : Jayati Dubey
July 26, 2024
Researchers tasked the AI model with answering 207 image challenge questions and providing written rationales for its answers.
Researchers at the National Institutes of Health (NIH) have found that an artificial intelligence (AI) model can solve medical quiz questions with high accuracy.
These quizzes are designed to test health professionals' diagnostic skills using clinical images and brief text summaries. Despite its accuracy, physician-graders identified errors in the AI model's image descriptions and explanations of its decision-making process.
The findings, which highlight AI's potential and limitations in clinical settings, were published in npj Digital Medicine. Researchers from NIH's National Library of Medicine (NLM) and Weill Cornell Medicine in New York City led the study.
"Integration of AI into healthcare holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner," said NLM acting director Stephen Sherry, PhD.
Sherry added that, according to the study, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis.
The AI model and human physicians answered questions from the New England Journal of Medicine (NEJM)'s Image Challenge, an online quiz that presents real clinical images and a short text description of the patient's symptoms, followed by multiple-choice diagnosis options.
Researchers tasked the AI model with answering 207 image challenge questions and providing written rationales for its answers. These rationales were to include an image description, relevant medical knowledge, and step-by-step reasoning.
Nine physicians from various institutions and specialties participated, answering their assigned questions in both "closed-book" settings (without external resources) and "open-book" settings (with external resources).
They were then provided with the correct answers, the AI model's answers, and its rationales and asked to score the AI's image descriptions, medical knowledge summaries, and reasoning.
The AI model and physicians both scored highly in selecting the correct diagnosis.
Interestingly, the AI model outperformed physicians in closed-book settings, while physicians using open-book tools performed better than the AI, especially on the most difficult questions.
Despite its overall accuracy, the AI model often made mistakes in describing medical images and explaining its reasoning, even when arriving at the correct diagnosis.
For example, when shown a photo of a patient's arm with two lesions, a physician would recognize that the same condition caused both lesions.
However, the AI model, misled by different angles and the illusion of different colors and shapes, failed to see the connection.
The researchers argued that these findings underline the importance of further evaluating multimodal AI technology before introducing it into the clinical setting.
"This technology has the potential to help clinicians augment their capabilities with data-driven insights that may lead to improved clinical decision-making. Understanding the risks and limitations of this technology is essential to harnessing its potential in medicine." said NLM senior investigator and corresponding author of the study, Zhiyong Lu.
The study used an AI model known as GPT-4V (Generative Pre-trained Transformer 4 with Vision), a multimodal AI model capable of processing both text and images.
While the study was small, it sheds light on the potential for multimodal AI to assist in medical decision-making. More research is needed to compare such models to physicians' diagnostic abilities.