Punching above its weight: a head-to-head comparison of deepseek-R1 and OpenAI-o1 on pancreatic adenocarcinoma-related questions

Objective: This study aimed to compare the performance of DeepSeek-R1 and OpenAI-o1 in addressing complex pancreatic ductal adenocarcinoma (PDAC)-related clinical questions, focusing on accuracy, comprehensiveness, safety, and reasoning quality. Methods: Twenty PDAC-related questions derived from th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Li, Cheng-Peng (VerfasserIn) , Chu, Yuan (VerfasserIn) , Jia, Wei-Wei (VerfasserIn) , Hakenberg, Priska (VerfasserIn) , Şandra-Petrescu, Flavius Ionuţ (VerfasserIn) , Reißfelder, Christoph (VerfasserIn) , Yang, Cui (VerfasserIn)
Dokumenttyp: Article (Journal)
Sprache:Englisch
Veröffentlicht: 2025-8-22
In: International journal of medical sciences
Year: 2025, Jahrgang: 22, Heft: 15, Pages: 3868-3877
ISSN:1449-1907
DOI:10.7150/ijms.118887
Online-Zugang:Verlag, kostenfrei, Volltext: https://doi.org/10.7150/ijms.118887
Verlag, kostenfrei, Volltext: https://www.medsci.org/v22p3868.htm
Volltext
Verfasserangaben:Cheng-Peng Li, Yuan Chu, Wei-Wei Jia, Priska Hakenberg, Flavius Șandra-Petrescu, Christoph Reißfelder, Cui Yang
Beschreibung
Zusammenfassung:Objective: This study aimed to compare the performance of DeepSeek-R1 and OpenAI-o1 in addressing complex pancreatic ductal adenocarcinoma (PDAC)-related clinical questions, focusing on accuracy, comprehensiveness, safety, and reasoning quality. Methods: Twenty PDAC-related questions derived from the up-to-date NCCN guidelines for PDAC were posed to both models. Responses were evaluated for accuracy, comprehensiveness, and safety, and chain-of-thought (CoT) outputs were rated for logical coherence and error handling by blinded clinical experts using 5-point Likert scales. Inter-rater reliability, evaluated scores, and character counts by both models were compared. Results: Both models demonstrated high accuracy (median score: 5 vs. 5, p=0.527) and safety (5 vs. 5, p=0.285). DeepSeek-R1 outperformed OpenAI-o1 in comprehensiveness (median: 5 vs. 4.5, p=0.015) and generated significantly longer responses (median characters: 544 vs. 248, p<0.001). For reasoning quality, DeepSeek-R1 achieved superior scores in logical coherence (median: 5 vs. 4, p<0.001) and error handling (5 vs. 4, p<0.001), with 75% of its responses scoring full points compared to OpenAI-o1's 5%. Conclusion: While both models exhibit high clinical utility, DeepSeek-R1's enhanced reasoning capabilities, open-source nature, and cost-effectiveness position it as a promising tool for complex oncology decision support. Further validation in real-world multimodal clinical scenarios is warranted.
Beschreibung:Gesehen am 03.11.2025
Beschreibung:Online Resource
ISSN:1449-1907
DOI:10.7150/ijms.118887