Automatic code generation for high-performance Discontinuous Galerkin methods on modern architectures

SIMD vectorization has lately become a key challenge in high-performance computing. However, hand-written explicitly vectorized code often poses a threat to the software’s sustainability. In this publication, we solve this sustainability and performance portability issue by enriching the simulation...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kempf, Dominic (VerfasserIn) , Heß, René (VerfasserIn) , Müthing, Steffen (VerfasserIn) , Bastian, Peter (VerfasserIn)
Dokumenttyp: Article (Journal)
Sprache:Englisch
Veröffentlicht: December 2020
In: ACM transactions on mathematical software
Year: 2020, Jahrgang: 47, Heft: 1, Pages: 1-31
ISSN:1557-7295
DOI:10.1145/3424144
Online-Zugang:Verlag, lizenzpflichtig, Volltext: https://doi.org/10.1145/3424144
Volltext
Verfasserangaben:Dominic Kempf, René Heß, Steffen Müthing, and Peter Bastian
Beschreibung
Zusammenfassung:SIMD vectorization has lately become a key challenge in high-performance computing. However, hand-written explicitly vectorized code often poses a threat to the software’s sustainability. In this publication, we solve this sustainability and performance portability issue by enriching the simulation framework dune-pdelab with a code generation approach. The approach is based on the well-known domain-specific language UFL but combines it with loopy, a more powerful intermediate representation for the computational kernel. Given this flexible tool, we present and implement a new class of vectorization strategies for the assembly of Discontinuous Galerkin methods on hexahedral meshes exploiting the finite element’s tensor product structure. The performance-optimal variant from this class is chosen by the code generator through an auto-tuning approach. The implementation is done within the open source PDE software framework Dune and the discretization module dune-pdelab. The strength of the proposed approach is illustrated with performance measurements for DG schemes for a scalar diffusion reaction equation and the Stokes equation. In our measurements, we utilize both the AVX2 and the AVX512 instruction set, achieving 30% to 40% of the machine’s theoretical peak performance for one matrix-free application of the operator.
Beschreibung:Gesehen am 13.02.2022
Beschreibung:Online Resource
ISSN:1557-7295
DOI:10.1145/3424144