Butterfly factorization for vision transformers on multi-IPU systems
Recent advances in machine learning have led to increasingly large and complex models, placing significant demands on computation and memory. Techniques such as Butterfly factorization have emerged to reduce model parameters and memory footprints while preserving accuracy. Specialized hardware accel...
Gespeichert in:
| Hauptverfasser: | , , , |
|---|---|
| Dokumenttyp: | Article (Journal) |
| Sprache: | Englisch |
| Veröffentlicht: |
March 2026
|
| In: |
Parallel computing
Year: 2026, Jahrgang: 127, Pages: 1-10 |
| ISSN: | 1872-7336 |
| DOI: | 10.1016/j.parco.2025.103165 |
| Online-Zugang: | Verlag, kostenfrei, Volltext: https://doi.org/10.1016/j.parco.2025.103165 Verlag, kostenfrei, Volltext: https://www.sciencedirect.com/science/article/pii/S0167819125000419 |
| Verfasserangaben: | S.-Kazem Shekofteh, Daniel Bogacz, Christian Alles, Holger Fröning |
| Zusammenfassung: | Recent advances in machine learning have led to increasingly large and complex models, placing significant demands on computation and memory. Techniques such as Butterfly factorization have emerged to reduce model parameters and memory footprints while preserving accuracy. Specialized hardware accelerators, such as Graphcore’s Intelligence Processing Units (IPUs), are designed to address these challenges through massive parallelism and efficient on-chip memory utilization. In this paper, we extend our analysis of Butterfly structures for efficient utilization on single and multiple IPUs, comparing their performance with GPUs. These structures drastically reduce the number of parameters and memory footprint while preserving model accuracy. Experimental results on the Graphcore GC200 IPU chip, compared with an NVIDIA A30 GPU, demonstrate a 98.5% compression ratio, with speedups of 1.6× and 1.3× for Butterfly and Pixelated Butterfly structures, respectively. Extending our evaluation to Vision Transformer (ViT) models, we compare Multi-GPU and Multi-IPU systems on the M2000 machine: Multi-GPU reaches a maximum accuracy of 84.51% with a training time of 401.44 min, whereas Multi-IPU attains a higher maximum accuracy of 88.92% with a training time of 694.03 min. These results demonstrate that Butterfly factorization enables substantial compression of ViT layers (up to 97.17%) while improving model accuracy. The findings highlight the promise of IPU machines as a suitable platform for large-scale machine learning model training, especially when coupled with sparsification methods like Butterfly factorization, thanks to their efficient support for model parallelism. |
|---|---|
| Beschreibung: | Online verfügbar: 27. November 2025, Artikelversion: 10. Dezember 2025 Gesehen am 05.02.2026 |
| Beschreibung: | Online Resource |
| ISSN: | 1872-7336 |
| DOI: | 10.1016/j.parco.2025.103165 |