Butterfly factorization for vision transformers on multi-IPU systems

Recent advances in machine learning have led to increasingly large and complex models, placing significant demands on computation and memory. Techniques such as Butterfly factorization have emerged to reduce model parameters and memory footprints while preserving accuracy. Specialized hardware accel...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Shekofteh, S. Kazem (VerfasserIn) , Bogacz, Daniel (VerfasserIn) , Alles, Christian (VerfasserIn) , Fröning, Holger (VerfasserIn)
Dokumenttyp: Article (Journal)
Sprache:Englisch
Veröffentlicht: March 2026
In: Parallel computing
Year: 2026, Jahrgang: 127, Pages: 1-10
ISSN:1872-7336
DOI:10.1016/j.parco.2025.103165
Online-Zugang:Verlag, kostenfrei, Volltext: https://doi.org/10.1016/j.parco.2025.103165
Verlag, kostenfrei, Volltext: https://www.sciencedirect.com/science/article/pii/S0167819125000419
Volltext
Verfasserangaben:S.-Kazem Shekofteh, Daniel Bogacz, Christian Alles, Holger Fröning
Beschreibung
Zusammenfassung:Recent advances in machine learning have led to increasingly large and complex models, placing significant demands on computation and memory. Techniques such as Butterfly factorization have emerged to reduce model parameters and memory footprints while preserving accuracy. Specialized hardware accelerators, such as Graphcore’s Intelligence Processing Units (IPUs), are designed to address these challenges through massive parallelism and efficient on-chip memory utilization. In this paper, we extend our analysis of Butterfly structures for efficient utilization on single and multiple IPUs, comparing their performance with GPUs. These structures drastically reduce the number of parameters and memory footprint while preserving model accuracy. Experimental results on the Graphcore GC200 IPU chip, compared with an NVIDIA A30 GPU, demonstrate a 98.5% compression ratio, with speedups of 1.6× and 1.3× for Butterfly and Pixelated Butterfly structures, respectively. Extending our evaluation to Vision Transformer (ViT) models, we compare Multi-GPU and Multi-IPU systems on the M2000 machine: Multi-GPU reaches a maximum accuracy of 84.51% with a training time of 401.44 min, whereas Multi-IPU attains a higher maximum accuracy of 88.92% with a training time of 694.03 min. These results demonstrate that Butterfly factorization enables substantial compression of ViT layers (up to 97.17%) while improving model accuracy. The findings highlight the promise of IPU machines as a suitable platform for large-scale machine learning model training, especially when coupled with sparsification methods like Butterfly factorization, thanks to their efficient support for model parallelism.
Beschreibung:Online verfügbar: 27. November 2025, Artikelversion: 10. Dezember 2025
Gesehen am 05.02.2026
Beschreibung:Online Resource
ISSN:1872-7336
DOI:10.1016/j.parco.2025.103165