Punching above its weight: a head-to-head comparison of deepseek-R1 and OpenAI-o1 on pancreatic adenocarcinoma-related questions

Objective: This study aimed to compare the performance of DeepSeek-R1 and OpenAI-o1 in addressing complex pancreatic ductal adenocarcinoma (PDAC)-related clinical questions, focusing on accuracy, comprehensiveness, safety, and reasoning quality. Methods: Twenty PDAC-related questions derived from th...

Full description

Saved in:

Bibliographic Details
Main Authors:	Li, Cheng-Peng (Author) , Chu, Yuan (Author) , Jia, Wei-Wei (Author) , Hakenberg, Priska (Author) , Şandra-Petrescu, Flavius Ionuţ (Author) , Reißfelder, Christoph (Author) , Yang, Cui (Author)
Format:	Article (Journal)
Language:	English
Published:	2025-8-22
In:	International journal of medical sciences Year: 2025, Volume: 22, Issue: 15, Pages: 3868-3877
ISSN:	1449-1907
DOI:	10.7150/ijms.118887
Online Access:	Verlag, kostenfrei, Volltext: https://doi.org/10.7150/ijms.118887 Verlag, kostenfrei, Volltext: https://www.medsci.org/v22p3868.htm
Author Notes:	Cheng-Peng Li, Yuan Chu, Wei-Wei Jia, Priska Hakenberg, Flavius Șandra-Petrescu, Christoph Reißfelder, Cui Yang

Description
Summary:	Objective: This study aimed to compare the performance of DeepSeek-R1 and OpenAI-o1 in addressing complex pancreatic ductal adenocarcinoma (PDAC)-related clinical questions, focusing on accuracy, comprehensiveness, safety, and reasoning quality. Methods: Twenty PDAC-related questions derived from the up-to-date NCCN guidelines for PDAC were posed to both models. Responses were evaluated for accuracy, comprehensiveness, and safety, and chain-of-thought (CoT) outputs were rated for logical coherence and error handling by blinded clinical experts using 5-point Likert scales. Inter-rater reliability, evaluated scores, and character counts by both models were compared. Results: Both models demonstrated high accuracy (median score: 5 vs. 5, p=0.527) and safety (5 vs. 5, p=0.285). DeepSeek-R1 outperformed OpenAI-o1 in comprehensiveness (median: 5 vs. 4.5, p=0.015) and generated significantly longer responses (median characters: 544 vs. 248, p<0.001). For reasoning quality, DeepSeek-R1 achieved superior scores in logical coherence (median: 5 vs. 4, p<0.001) and error handling (5 vs. 4, p<0.001), with 75% of its responses scoring full points compared to OpenAI-o1's 5%. Conclusion: While both models exhibit high clinical utility, DeepSeek-R1's enhanced reasoning capabilities, open-source nature, and cost-effectiveness position it as a promising tool for complex oncology decision support. Further validation in real-world multimodal clinical scenarios is warranted.
Item Description:	Gesehen am 03.11.2025
Physical Description:	Online Resource
ISSN:	1449-1907
DOI:	10.7150/ijms.118887

Punching above its weight: a head-to-head comparison of deepseek-R1 and OpenAI-o1 on pancreatic adenocarcinoma-related questions

Similar Items