Punching above its weight: a head-to-head comparison of deepseek-R1 and OpenAI-o1 on pancreatic adenocarcinoma-related questions
Objective: This study aimed to compare the performance of DeepSeek-R1 and OpenAI-o1 in addressing complex pancreatic ductal adenocarcinoma (PDAC)-related clinical questions, focusing on accuracy, comprehensiveness, safety, and reasoning quality. Methods: Twenty PDAC-related questions derived from th...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article (Journal) |
| Language: | English |
| Published: |
2025-8-22
|
| In: |
International journal of medical sciences
Year: 2025, Volume: 22, Issue: 15, Pages: 3868-3877 |
| ISSN: | 1449-1907 |
| DOI: | 10.7150/ijms.118887 |
| Online Access: | Verlag, kostenfrei, Volltext: https://doi.org/10.7150/ijms.118887 Verlag, kostenfrei, Volltext: https://www.medsci.org/v22p3868.htm |
| Author Notes: | Cheng-Peng Li, Yuan Chu, Wei-Wei Jia, Priska Hakenberg, Flavius Șandra-Petrescu, Christoph Reißfelder, Cui Yang |
| Summary: | Objective: This study aimed to compare the performance of DeepSeek-R1 and OpenAI-o1 in addressing complex pancreatic ductal adenocarcinoma (PDAC)-related clinical questions, focusing on accuracy, comprehensiveness, safety, and reasoning quality. Methods: Twenty PDAC-related questions derived from the up-to-date NCCN guidelines for PDAC were posed to both models. Responses were evaluated for accuracy, comprehensiveness, and safety, and chain-of-thought (CoT) outputs were rated for logical coherence and error handling by blinded clinical experts using 5-point Likert scales. Inter-rater reliability, evaluated scores, and character counts by both models were compared. Results: Both models demonstrated high accuracy (median score: 5 vs. 5, p=0.527) and safety (5 vs. 5, p=0.285). DeepSeek-R1 outperformed OpenAI-o1 in comprehensiveness (median: 5 vs. 4.5, p=0.015) and generated significantly longer responses (median characters: 544 vs. 248, p<0.001). For reasoning quality, DeepSeek-R1 achieved superior scores in logical coherence (median: 5 vs. 4, p<0.001) and error handling (5 vs. 4, p<0.001), with 75% of its responses scoring full points compared to OpenAI-o1's 5%. Conclusion: While both models exhibit high clinical utility, DeepSeek-R1's enhanced reasoning capabilities, open-source nature, and cost-effectiveness position it as a promising tool for complex oncology decision support. Further validation in real-world multimodal clinical scenarios is warranted. |
|---|---|
| Item Description: | Gesehen am 03.11.2025 |
| Physical Description: | Online Resource |
| ISSN: | 1449-1907 |
| DOI: | 10.7150/ijms.118887 |