Improving accuracy and source transparency in responses to soft tissue sarcoma queries using GPT-4o enhanced with German evidence-based guidelines
Introduction: This study aimed to evaluate the effectiveness of GPT-4o, with and without retrieval-augmented generation (RAG), in responding to soft tissue sarcoma (STS)-related queries. Methods: The study used a 20-question dataset derived from clinical scenarios related to adult STS. The responses...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article (Journal) |
| Language: | English |
| Published: |
June 2025
|
| In: |
Oncology research and treatment
Year: 2025, Volume: 48, Issue: 6, Pages: 351-359 |
| ISSN: | 2296-5262 |
| DOI: | 10.1159/000544978 |
| Online Access: | Verlag, lizenzpflichtig, Volltext: https://doi.org/10.1159/000544978 |
| Author Notes: | Cheng-Peng Li, Wei-Wei Jia, Yuan Chu, Franka Menge, Tobias Speer, Christoph Reißfelder, Peter Hohenberger, Jens Jakob, Cui Yang |
| Summary: | Introduction: This study aimed to evaluate the effectiveness of GPT-4o, with and without retrieval-augmented generation (RAG), in responding to soft tissue sarcoma (STS)-related queries. Methods: The study used a 20-question dataset derived from clinical scenarios related to adult STS. The responses were generated by GPT-4o with and without the RAG approach. The RAG system incorporated the English version of German evidence-based S3 guidelines through an embedding-based retrieval system. Two sarcoma experts evaluated the responses for accuracy, comprehensiveness, and safety using a Likert scale. Statistical analyses were conducted to compare the performances. Results: GPT-4o with RAG outperformed the model without RAG across all evaluated areas (p < 0.05). GPT-4o without RAG had a 40% error rate, which was reduced to 10% by the RAG approach. In 90% of the questions, the pages with the relevant information that addressed the questions were correctly cited using the retrieval system. Conclusion: The RAG approach significantly enhanced the performance of GPT-4o in answering STS-related questions. However, the model still produced incorrect responses in certain complex scenarios. GPT-4o, even with RAG, should be used cautiously in clinical settings, particularly for rare diseases like sarcoma. Human expertise remains irreplaceable in medical decision-making. We evaluated how well the artificial intelligence (AI) model GPT-4o performed when responding to questions on soft tissue sarcoma (STS), a rare form of cancer. We developed 20 questions based on actual medical scenarios involving STS and tested the model’s capacity to deliver thorough and accurate answers both with and without using a retrieval-augmented generation (RAG) system, which uses German guidelines for STS to help the model find relevant information. The correctness, thoroughness, and safety of the model’s replies were assessed by two sarcoma specialists. The outcomes demonstrated that GPT-4o’s performance was enhanced by the RAG system. The AI committed mistakes on 40% of the questions without RAG, but with RAG, the error rate decreased to 10%. In 90% of cases, the RAG system correctly identified the information needed to answer the questions. Although the RAG system improved the model’s accuracy, it still struggled with some complex cases. The study suggests that while GPT-4o with RAG can assist in medical decision-making, it cannot replace human expertise, especially for rare diseases like sarcoma. |
|---|---|
| Item Description: | Online veröffentlicht: 28. Februar 2025 Gesehen am 12.08.2025 |
| Physical Description: | Online Resource |
| ISSN: | 2296-5262 |
| DOI: | 10.1159/000544978 |