Medical worker touch virtual medical revolution and advance of technology Artificial Intelligence,AI Deep Learning for medical research,Transformation of innovation and technology for future Health
Commonly used large language models (LLMs) have demonstrated the ability to provide appropriate, guideline-aligned treatment recommendations for patients with straightforward cases of early-stage hepatocellular carcinoma. However, a study published in PLOS Medicine reveals that these models show greater disagreement with physician recommendations in cases of late-stage disease. The findings come from a Korean retrospective registry study that sheds light on the potential and limitations of AI in medical decision-making.
According to the study, LLMs can support treatment decisions for early-stage liver cancer, but their performance is more limited in advanced disease. This underscores the importance of using LLMs as a complement to, rather than a replacement for, clinical expertise. “Our study shows that [LLMs] can help support treatment decisions for early-stage liver cancer, but their performance is more limited in advanced disease,” said Ji Won Han, MD, PhD, from the Division of Gastroenterology and Hepatology, Department of Internal Medicine, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.
Study Methods and Analysis
The research team explored the clinical relevance of treatment recommendations generated by several LLMs, including ChatGPT 4o, Gemini 2.0, and Claude 3.5. These recommendations were compared with physicians’ real-world decisions and the resulting patient outcomes. The study retrospectively analyzed data from the Korean Primary Liver Cancer Registry, covering 13,614 patients diagnosed with treatment-naive hepatocellular carcinoma between 2008 and 2020.
Recommendations were tested using standardized prompts that referenced guidelines from the American Association for the Study of Liver Diseases and the European Association for the Study of the Liver. Patients were classified based on whether the LLM recommendations matched the actual treatments administered by physicians. Decision trees were employed to identify factors influencing treatment choices.
Key Findings and Implications
Among the LLMs, Gemini 2.0 achieved the highest concordance rate with physician decisions at 32.7%, followed by ChatGPT 4o at 31.1% and Claude 3.5 at 26.8%. The study revealed that patients with early-stage (BCLC-A) disease experienced better survival outcomes when LLM recommendations aligned with physician decisions. In contrast, for patients with advanced-stage (BCLC-C) disease, concordance between AI recommendations and physician treatments was associated with worse survival outcomes.
ChatGPT 4o hazard ratio (HR) = 0.743; 95% confidence interval (CI) = 0.665–0.831; P < .001 for BCLC-A
ChatGPT 4o HR = 1.650; 95% CI = 1.523–1.787; P < .001 for BCLC-C
Analysis of factors impacting decisions demonstrated that physicians tended to prioritize liver function parameters, while LLMs focused more on tumor characteristics. Physicians generally avoided curative treatments when hepatic reserve was limited in early-stage cases but opted for more locoregional therapies in advanced-stage disease, even when these choices diverged from guideline recommendations for systemic therapy.
Expert Opinions and Future Directions
The study authors suggest that while LLMs may serve as adjunctive tools for guideline-concordant decisions in straightforward scenarios, their recommendations may reflect limited contextual awareness in complex clinical situations requiring individualized care. “LLM recommendations should be interpreted cautiously alongside clinical judgment,” they advised.
As the study was limited by its retrospective design, lack of imaging information, and focus on guideline-era treatments, the authors recommended prospective validation of these findings. This development follows ongoing discussions in the medical community about the role of AI in healthcare, highlighting the need for careful integration of technology with human expertise.
Looking ahead, further research is needed to refine AI models and enhance their ability to handle complex clinical scenarios. As technology continues to evolve, the collaboration between AI and healthcare professionals will be crucial in improving patient outcomes and advancing medical practice.