Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases

Importance - Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility. - Objective - To evaluate the diagnostic performance of four LLMs (GPT-4o min...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Yao, Guanhong (VerfasserIn) , Zhang, WuJi (VerfasserIn) , Zhu, Yingxi (VerfasserIn) , Wong, Ut-kei (VerfasserIn) , Zhang, Yanfeng (VerfasserIn) , Yang, Cui (VerfasserIn) , Shen, Guanghao (VerfasserIn) , Li, Zhanguo (VerfasserIn) , Gao, Hui (VerfasserIn)
Dokumenttyp: Article (Journal)
Sprache:Englisch
Veröffentlicht: November 2025
In: International journal of medical informatics
Year: 2025, Jahrgang: 203, Pages: 1-8
ISSN:1872-8243
DOI:10.1016/j.ijmedinf.2025.106026
Online-Zugang:Verlag, kostenfrei, Volltext: https://doi.org/10.1016/j.ijmedinf.2025.106026
Verlag, kostenfrei, Volltext: https://www.sciencedirect.com/science/article/pii/S1386505625002436
Volltext
Verfasserangaben:Guanhong Yao, WuJi Zhang, Yingxi Zhu, Ut-kei Wong, Yanfeng Zhang, Cui Yang, Guanghao Shen, Zhanguo Li, Hui Gao

MARC

LEADER 00000caa a2200000 c 4500
001 1936787318
003 DE-627
005 20251120111729.0
007 cr uuu---uuuuu
008 250925s2025 xx |||||o 00| ||eng c
024 7 |a 10.1016/j.ijmedinf.2025.106026  |2 doi 
035 |a (DE-627)1936787318 
035 |a (DE-599)KXP1936787318 
040 |a DE-627  |b ger  |c DE-627  |e rda 
041 |a eng 
084 |a 33  |2 sdnb 
100 1 |a Yao, Guanhong  |e VerfasserIn  |0 (DE-588)137738036X  |0 (DE-627)1936787695  |4 aut 
245 1 0 |a Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases  |c Guanhong Yao, WuJi Zhang, Yingxi Zhu, Ut-kei Wong, Yanfeng Zhang, Cui Yang, Guanghao Shen, Zhanguo Li, Hui Gao 
264 1 |c November 2025 
300 |b Illustrationen 
300 |a 8 
336 |a Text  |b txt  |2 rdacontent 
337 |a Computermedien  |b c  |2 rdamedia 
338 |a Online-Ressource  |b cr  |2 rdacarrier 
500 |a Online verfügbar: 25. Juni 2025, Artikelversion: 4. Juli 2025 
500 |a Gesehen am 25.09.2025 
520 |a Importance - Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility. - Objective - To evaluate the diagnostic performance of four LLMs (GPT-4o mini, GPT-4o, ERNIE, and Llama-3) using real-world inpatient medical records and assess the impact of different prompt engineering methods. - Method - This single-center, retrospective study was conducted at Peking University International Hospital. It involved 1,122 medical records categorized into common rheumatic autoimmune diseases, rare rheumatic autoimmune diseases, and non-rheumatic diseases. Four LLMs were evaluated using two prompt engineering methods: few-shot and chain-of-thought prompting. Diagnostic accuracy (hit1) was defined as the inclusion of the first final diagnosis from the medical record in the model’s top prediction. - Results - Hit1 of four LLMs were as follows: GPT-4omini (81.8 %), GPT-4o (82.4 %), ERNIE (82.9 %) and Llama-3 (82.7 %). Few-shot prompting significantly improved GPT-4o’s hit1 (85.9 %) compared to its base model (p = 0.02), outperforming other models (all p < 0.05). Chain-of-thought prompting showed no significant improvement. Hit1 for both common and rare rheumatic diseases was consistently higher than that for non-rheumatic disease. Few-shot prompting increased costs per correct diagnosis for GPT-4o by approximately ¥4.54. - Conclusions - LLMs, including GPT-4o, demonstrate promising diagnostic accuracy on real medical records. Few-shot prompting enhances performance but at higher costs, underscoring the need for accuracy improvements and cost management. These findings inform LLM development in Chinese medical contexts and highlight the necessity for further multi-center validation. 
650 4 |a Accuracy 
650 4 |a Large Language Models, LLMs 
650 4 |a Prompt Engineering 
650 4 |a Realworld 
650 4 |a Rheumatic Disease 
700 1 |a Zhang, WuJi  |e VerfasserIn  |4 aut 
700 1 |a Zhu, Yingxi  |e VerfasserIn  |4 aut 
700 1 |a Wong, Ut-kei  |e VerfasserIn  |4 aut 
700 1 |a Zhang, Yanfeng  |e VerfasserIn  |4 aut 
700 1 |a Yang, Cui  |d 1984-  |e VerfasserIn  |0 (DE-588)1136151982  |0 (DE-627)891949968  |0 (DE-576)490363180  |4 aut 
700 1 |a Shen, Guanghao  |e VerfasserIn  |4 aut 
700 1 |a Li, Zhanguo  |e VerfasserIn  |4 aut 
700 1 |a Gao, Hui  |e VerfasserIn  |4 aut 
773 0 8 |i Enthalten in  |t International journal of medical informatics  |d Amsterdam [u.a.] : Elsevier, 1997  |g 203(2025) vom: Nov., Artikel-ID 106026, Seite 1-8  |h Online-Ressource  |w (DE-627)265783720  |w (DE-600)1466296-6  |w (DE-576)074890913  |x 1872-8243  |7 nnas  |a Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases 
773 1 8 |g volume:203  |g year:2025  |g month:11  |g elocationid:106026  |g pages:1-8  |g extent:8  |a Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases 
856 4 0 |u https://doi.org/10.1016/j.ijmedinf.2025.106026  |x Verlag  |x Resolving-System  |z kostenfrei  |3 Volltext 
856 4 0 |u https://www.sciencedirect.com/science/article/pii/S1386505625002436  |x Verlag  |z kostenfrei  |3 Volltext 
951 |a AR 
992 |a 20250925 
993 |a Article 
994 |a 2025 
998 |g 1136151982  |a Yang, Cui  |m 1136151982:Yang, Cui  |d 60000  |d 61800  |e 60000PY1136151982  |e 61800PY1136151982  |k 0/60000/  |k 1/60000/61800/  |p 6 
999 |a KXP-PPN1936787318  |e 4775902512 
BIB |a Y 
SER |a journal 
JSO |a {"relHost":[{"origin":[{"dateIssuedDisp":"1997-","publisher":"Elsevier","dateIssuedKey":"1997","publisherPlace":"Amsterdam [u.a.]"}],"id":{"issn":["1872-8243"],"zdb":["1466296-6"],"eki":["265783720"]},"physDesc":[{"extent":"Online-Ressource"}],"title":[{"title_sort":"International journal of medical informatics","title":"International journal of medical informatics"}],"note":["Gesehen am 05.06.2018"],"disp":"Comparing the accuracy of large language models and prompt engineering in diagnosing realworld casesInternational journal of medical informatics","type":{"media":"Online-Ressource","bibl":"periodical"},"recId":"265783720","language":["eng"],"pubHistory":["Volume 44, issue 1 (March 1997)-"],"part":{"volume":"203","text":"203(2025) vom: Nov., Artikel-ID 106026, Seite 1-8","extent":"8","year":"2025","pages":"1-8"}}],"physDesc":[{"noteIll":"Illustrationen","extent":"8 S."}],"name":{"displayForm":["Guanhong Yao, WuJi Zhang, Yingxi Zhu, Ut-kei Wong, Yanfeng Zhang, Cui Yang, Guanghao Shen, Zhanguo Li, Hui Gao"]},"id":{"eki":["1936787318"],"doi":["10.1016/j.ijmedinf.2025.106026"]},"origin":[{"dateIssuedKey":"2025","dateIssuedDisp":"November 2025"}],"recId":"1936787318","language":["eng"],"note":["Online verfügbar: 25. Juni 2025, Artikelversion: 4. Juli 2025","Gesehen am 25.09.2025"],"type":{"media":"Online-Ressource","bibl":"article-journal"},"person":[{"display":"Yao, Guanhong","roleDisplay":"VerfasserIn","role":"aut","family":"Yao","given":"Guanhong"},{"family":"Zhang","given":"WuJi","display":"Zhang, WuJi","roleDisplay":"VerfasserIn","role":"aut"},{"given":"Yingxi","family":"Zhu","role":"aut","roleDisplay":"VerfasserIn","display":"Zhu, Yingxi"},{"given":"Ut-kei","family":"Wong","role":"aut","display":"Wong, Ut-kei","roleDisplay":"VerfasserIn"},{"family":"Zhang","given":"Yanfeng","display":"Zhang, Yanfeng","roleDisplay":"VerfasserIn","role":"aut"},{"role":"aut","display":"Yang, Cui","roleDisplay":"VerfasserIn","given":"Cui","family":"Yang"},{"given":"Guanghao","family":"Shen","role":"aut","roleDisplay":"VerfasserIn","display":"Shen, Guanghao"},{"role":"aut","roleDisplay":"VerfasserIn","display":"Li, Zhanguo","given":"Zhanguo","family":"Li"},{"display":"Gao, Hui","roleDisplay":"VerfasserIn","role":"aut","family":"Gao","given":"Hui"}],"title":[{"title":"Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases","title_sort":"Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases"}]} 
SRT |a YAOGUANHONCOMPARINGT2025