Benchmarking foundation models as feature extractors for weakly supervised computational pathology

Numerous pathology foundation models have been developed to extract clinically relevant information. There is currently limited literature independently evaluating these foundation models on external cohorts and clinically relevant tasks to uncover adjustments for future improvements. Here we benchm...

Full description

Saved in:
Bibliographic Details
Main Authors: Neidlinger, Peter (Author) , El Nahhas, Omar S. M. (Author) , Muti, Hannah Sophie (Author) , Lenz, Tim (Author) , Hoffmeister, Michael (Author) , Brenner, Hermann (Author) , van Treeck, Marko (Author) , Langer, Rupert (Author) , Dislich, Bastian (Author) , Behrens, Hans Michael (Author) , Röcken, Christoph (Author) , Foersch, Sebastian (Author) , Truhn, Daniel (Author) , Marra, Antonio (Author) , Saldanha, Oliver Lester (Author) , Kather, Jakob Nikolas (Author)
Format: Article (Journal)
Language:English
Published: 01 October 2025
In: Nature biomedical engineering
Year: 2025, Pages: 1-11
ISSN:2157-846X
DOI:10.1038/s41551-025-01516-3
Online Access:Verlag, kostenfrei, Volltext: https://doi.org/10.1038/s41551-025-01516-3
Verlag, kostenfrei, Volltext: https://www.nature.com/articles/s41551-025-01516-3
Get full text
Author Notes:Peter Neidlinger, Omar S. M. El Nahhas, Hannah Sophie Muti, Tim Lenz, Michael Hoffmeister, Hermann Brenner, Marko van Treeck, Rupert Langer, Bastian Dislich, Hans Michael Behrens, Christoph Röcken, Sebastian Foersch, Daniel Truhn, Antonio Marra, Oliver Lester Saldanha & Jakob Nikolas Kather
Description
Summary:Numerous pathology foundation models have been developed to extract clinically relevant information. There is currently limited literature independently evaluating these foundation models on external cohorts and clinically relevant tasks to uncover adjustments for future improvements. Here we benchmark 19 histopathology foundation models on 13 patient cohorts with 6,818 patients and 9,528 slides from lung, colorectal, gastric and breast cancers. The models were evaluated on weakly supervised tasks related to biomarkers, morphological properties and prognostic outcomes. We show that a vision-language foundation model, CONCH, yielded the highest overall performance when compared with vision-only foundation models, with Virchow2 as close second, although its superior performance was less pronounced in low-data scenarios and low-prevalence tasks. The experiments reveal that foundation models trained on distinct cohorts learn complementary features to predict the same label, and can be fused to outperform the current state of the art. An ensemble combining CONCH and Virchow2 predictions outperformed individual models in 55% of tasks, leveraging their complementary strengths in classification scenarios. Moreover, our findings suggest that data diversity outweighs data volume for foundation models.
Item Description:Gesehen am 13.03.2026
Physical Description:Online Resource
ISSN:2157-846X
DOI:10.1038/s41551-025-01516-3