Lifting the curse from high-dimensional data: automated projection pursuit clustering for a variety of biological data modalities

Unsupervised clustering is a powerful machine-learning technique widely used to analyze high-dimensional biological data. It plays a crucial role in uncovering patterns, structures, and inherent relationships within complex datasets without relying on predefined labels. In the context of biology, hi...

Full description

Saved in:
Bibliographic Details
Main Authors: Simpson, Claire (Author) , Tabatsky, Evgeniy (Author) , Rahil, Zainab (Author) , Eddins, Devon J (Author) , Tkachev, Sasha (Author) , Georgescauld, Florian (Author) , Papalegis, Derek (Author) , Culka, Martin (Author) , Levy, Tyler (Author) , Gregoretti, Ivan (Author) , Meehan, Connor (Author) , Schiller, Chiara (Author) , Bestak, Kresimir (Author) , Schapiro, Denis (Author) , Chernyshev, Andrei (Author) , Walther, Guenther (Author) , Ghosn, Eliver E B (Author) , Orlova, Darya (Author)
Format: Article (Journal)
Language:English
Published: 2025
In: GigaScience
Year: 2025, Volume: 14, Pages: 1-20
ISSN:2047-217X
DOI:10.1093/gigascience/giaf052
Online Access:Verlag, kostenfrei, Volltext: https://doi.org/10.1093/gigascience/giaf052
Get full text
Author Notes:Claire Simpson, Evgeniy Tabatsky, Zainab Rahil, Devon J. Eddins, Sasha Tkachev, Florian Georgescauld, Derek Papalegis, Martin Culka, Tyler Levy, Ivan Gregoretti, Connor Meehan, Chiara Schiller, Kresimir Bestak, Denis Schapiro, Andrei Chernyshev, Guenther Walther, Eliver E.B. Ghosn, and Darya Orlova
Description
Summary:Unsupervised clustering is a powerful machine-learning technique widely used to analyze high-dimensional biological data. It plays a crucial role in uncovering patterns, structures, and inherent relationships within complex datasets without relying on predefined labels. In the context of biology, high-dimensional data may include transcriptomics, proteomics, and a variety of single-cell omics data. Most existing clustering algorithms operate directly in the high-dimensional space, and their performance may be negatively affected by the phenomenon known as the curse of dimensionality. Here, we show an alternative clustering approach that alleviates the curse by sequentially projecting high-dimensional data into a low-dimensional representation. We validated the effectiveness of our approach, named automated projection pursuit (APP), across various biological data modalities, including flow and mass cytometry data, scRNA-seq, multiplex imaging data, and T-cell receptor repertoire data. APP efficiently recapitulated experimentally validated cell-type definitions and revealed new biologically meaningful patterns.
Item Description:Veröffentlicht: 29. Mai 2025
Gesehen am 23.09.2025
Physical Description:Online Resource
ISSN:2047-217X
DOI:10.1093/gigascience/giaf052