Hallucination filtering in radiology vision-language models using discrete semantic entropy

Objective: To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image-based visual question answering (VQA). Materials and methods: This retrospective study e...

Full description

Saved in:
Bibliographic Details
Main Authors: Wienholt, Patrick (Author) , Caselitz, Sophie (Author) , Siepmann, Robert (Author) , Bruners, Philipp (Author) , Bressem, Keno (Author) , Kuhl, Christiane (Author) , Kather, Jakob Nikolas (Author) , Nebelung, Sven (Author) , Truhn, Daniel (Author)
Format: Article (Journal)
Language:English
Published: 20 February 2026
In: European radiology
Year: 2026, Pages: 1-12
ISSN:1432-1084
DOI:10.1007/s00330-026-12384-z
Online Access:Verlag, kostenfrei, Volltext: https://doi.org/10.1007/s00330-026-12384-z
Get full text
Author Notes:Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung and Daniel Truhn
Description
Summary:Objective: To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image-based visual question answering (VQA). Materials and methods: This retrospective study evaluated DSE using two publicly available, de-identified datasets: the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 (Generative Pretrained Transformer) answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < 0.004 for statistical significance. Results: Across 706 image–question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < 0.001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. Conclusion: DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.
Item Description:Veröffentlicht: 20. Februar 2026
Gesehen am 10.04.2026
Physical Description:Online Resource
ISSN:1432-1084
DOI:10.1007/s00330-026-12384-z