Written by : Jayati Dubey
January 6, 2025
The results revealed a significant decline in the accuracy of AI diagnoses in dynamic scenarios. GPT-4, for instance, achieved 82% accuracy when diagnosing pre-prepared case summaries.
Artificial intelligence (AI) has made remarkable strides in healthcare, excelling at tasks such as analyzing X-rays and suggesting treatment plans.
However, its ability to navigate the nuanced dynamics of real-time patient conversations and accurately diagnose conditions through dialogue—a cornerstone of medical practice—remains a significant challenge.
A new study conducted by Harvard Medical School and Stanford University, published in Nature Medicine, sheds light on these limitations while introducing a novel testing framework, the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD).
The study aims to assess how well large language models (LLMs) perform during simulated doctor-patient interactions. This is becoming increasingly important as patients turn to AI systems such as ChatGPT to interpret symptoms and medical results.
Dr Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School and senior author of the study, emphasized the unique challenges posed by medical conversations.
He explained that while AI models often excel at structured tasks such as medical board exams, they falter when required to navigate the dynamic back-and-forth exchanges typical of a doctor’s visit.
Tasks such as asking timely questions, piecing together fragmented information, and reasoning through complex symptoms go beyond the scope of multiple-choice questions and reveal gaps in current AI capabilities.
To evaluate the performance of AI in real-world scenarios, the researchers used the CRAFT-MD framework to test four leading AI models across 2,000 medical cases spanning 12 specialties.
Unlike traditional evaluations that rely on structured multiple-choice formats, this framework simulates open-ended, real-world patient interactions.
The results revealed a significant decline in the accuracy of AI diagnoses in dynamic scenarios. GPT-4, for instance, achieved 82% accuracy when diagnosing pre-prepared case summaries.
However, its accuracy dropped to 63% when required to gather information through dialogue. In even less structured scenarios, such as simulated patient interviews, diagnostic accuracy fell to 26%.
The study highlighted that AI struggled to synthesize information across multiple exchanges during these conversations.
Common issues included missing critical details during patient history-taking, failing to ask appropriate follow-up questions, and an inability to combine different types of data, such as medical images and patient-reported symptoms.
Despite exposing these weaknesses, the CRAFT-MD framework proved to be an efficient evaluation tool. It is capable of processing 10,000 simulated conversations in just 48-72 hours, with only 15-16 hours of expert review required.
Traditional human-based evaluations, by contrast, would demand extensive recruitment efforts and over 1,100 hours for both simulations and assessments.
Dr Roxana Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University, noted that this efficiency allows researchers to test AI models in scenarios that more closely mirror real-world interactions.
The findings of the study underline the need for significant advancements in AI systems before they can reliably engage in complex medical conversations.
The researchers highlighted the importance of developing models capable of handling unstructured, dynamic conversations and integrating diverse data types, such as clinical measurements, text, and images.
Future systems should also aim to interpret non-verbal cues like facial expressions and tone of voice to more closely replicate human interactions.
The researchers also emphasized the need for thorough testing, combining AI evaluations with human expert assessments to ensure patient safety. These safeguards are essential to avoid exposing patients to unverified AI systems prematurely.
The study underscores that while AI has tremendous potential in healthcare, current systems are not yet equipped to replace human expertise in real-world clinical settings.
These tools are best viewed as supplements to enhance, rather than replace, the work of healthcare professionals.
Dr Daneshjou remarked that CRAFT-MD offers a framework for advancing the field by focusing on real-world testing scenarios that push AI performance toward meaningful healthcare applications.
Stay tuned for more such updates on Digital Health News.