Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases

Importance - Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility. - Objective - To evaluate the diagnostic performance of four LLMs (GPT-4o min...

Full description

Saved in:
Bibliographic Details
Main Authors: Yao, Guanhong (Author) , Zhang, WuJi (Author) , Zhu, Yingxi (Author) , Wong, Ut-kei (Author) , Zhang, Yanfeng (Author) , Yang, Cui (Author) , Shen, Guanghao (Author) , Li, Zhanguo (Author) , Gao, Hui (Author)
Format: Article (Journal)
Language:English
Published: November 2025
In: International journal of medical informatics
Year: 2025, Volume: 203, Pages: 1-8
ISSN:1872-8243
DOI:10.1016/j.ijmedinf.2025.106026
Online Access:Verlag, kostenfrei, Volltext: https://doi.org/10.1016/j.ijmedinf.2025.106026
Verlag, kostenfrei, Volltext: https://www.sciencedirect.com/science/article/pii/S1386505625002436
Get full text
Author Notes:Guanhong Yao, WuJi Zhang, Yingxi Zhu, Ut-kei Wong, Yanfeng Zhang, Cui Yang, Guanghao Shen, Zhanguo Li, Hui Gao
Description
Summary:Importance - Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility. - Objective - To evaluate the diagnostic performance of four LLMs (GPT-4o mini, GPT-4o, ERNIE, and Llama-3) using real-world inpatient medical records and assess the impact of different prompt engineering methods. - Method - This single-center, retrospective study was conducted at Peking University International Hospital. It involved 1,122 medical records categorized into common rheumatic autoimmune diseases, rare rheumatic autoimmune diseases, and non-rheumatic diseases. Four LLMs were evaluated using two prompt engineering methods: few-shot and chain-of-thought prompting. Diagnostic accuracy (hit1) was defined as the inclusion of the first final diagnosis from the medical record in the model’s top prediction. - Results - Hit1 of four LLMs were as follows: GPT-4omini (81.8 %), GPT-4o (82.4 %), ERNIE (82.9 %) and Llama-3 (82.7 %). Few-shot prompting significantly improved GPT-4o’s hit1 (85.9 %) compared to its base model (p = 0.02), outperforming other models (all p < 0.05). Chain-of-thought prompting showed no significant improvement. Hit1 for both common and rare rheumatic diseases was consistently higher than that for non-rheumatic disease. Few-shot prompting increased costs per correct diagnosis for GPT-4o by approximately ¥4.54. - Conclusions - LLMs, including GPT-4o, demonstrate promising diagnostic accuracy on real medical records. Few-shot prompting enhances performance but at higher costs, underscoring the need for accuracy improvements and cost management. These findings inform LLM development in Chinese medical contexts and highlight the necessity for further multi-center validation.
Item Description:Online verfügbar: 25. Juni 2025, Artikelversion: 4. Juli 2025
Gesehen am 25.09.2025
Physical Description:Online Resource
ISSN:1872-8243
DOI:10.1016/j.ijmedinf.2025.106026