Punching above its weight: a head-to-head comparison of deepseek-R1 and OpenAI-o1 on pancreatic adenocarcinoma-related questions

Objective: This study aimed to compare the performance of DeepSeek-R1 and OpenAI-o1 in addressing complex pancreatic ductal adenocarcinoma (PDAC)-related clinical questions, focusing on accuracy, comprehensiveness, safety, and reasoning quality. Methods: Twenty PDAC-related questions derived from th...

Full description

Saved in:
Bibliographic Details
Main Authors: Li, Cheng-Peng (Author) , Chu, Yuan (Author) , Jia, Wei-Wei (Author) , Hakenberg, Priska (Author) , Şandra-Petrescu, Flavius Ionuţ (Author) , Reißfelder, Christoph (Author) , Yang, Cui (Author)
Format: Article (Journal)
Language:English
Published: 2025-8-22
In: International journal of medical sciences
Year: 2025, Volume: 22, Issue: 15, Pages: 3868-3877
ISSN:1449-1907
DOI:10.7150/ijms.118887
Online Access:Verlag, kostenfrei, Volltext: https://doi.org/10.7150/ijms.118887
Verlag, kostenfrei, Volltext: https://www.medsci.org/v22p3868.htm
Get full text
Author Notes:Cheng-Peng Li, Yuan Chu, Wei-Wei Jia, Priska Hakenberg, Flavius Șandra-Petrescu, Christoph Reißfelder, Cui Yang
Description
Summary:Objective: This study aimed to compare the performance of DeepSeek-R1 and OpenAI-o1 in addressing complex pancreatic ductal adenocarcinoma (PDAC)-related clinical questions, focusing on accuracy, comprehensiveness, safety, and reasoning quality. Methods: Twenty PDAC-related questions derived from the up-to-date NCCN guidelines for PDAC were posed to both models. Responses were evaluated for accuracy, comprehensiveness, and safety, and chain-of-thought (CoT) outputs were rated for logical coherence and error handling by blinded clinical experts using 5-point Likert scales. Inter-rater reliability, evaluated scores, and character counts by both models were compared. Results: Both models demonstrated high accuracy (median score: 5 vs. 5, p=0.527) and safety (5 vs. 5, p=0.285). DeepSeek-R1 outperformed OpenAI-o1 in comprehensiveness (median: 5 vs. 4.5, p=0.015) and generated significantly longer responses (median characters: 544 vs. 248, p<0.001). For reasoning quality, DeepSeek-R1 achieved superior scores in logical coherence (median: 5 vs. 4, p<0.001) and error handling (5 vs. 4, p<0.001), with 75% of its responses scoring full points compared to OpenAI-o1's 5%. Conclusion: While both models exhibit high clinical utility, DeepSeek-R1's enhanced reasoning capabilities, open-source nature, and cost-effectiveness position it as a promising tool for complex oncology decision support. Further validation in real-world multimodal clinical scenarios is warranted.
Item Description:Gesehen am 03.11.2025
Physical Description:Online Resource
ISSN:1449-1907
DOI:10.7150/ijms.118887