Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes

Jean-Matthieu Gallard; Leonhard Rannabauer; Anne Reinarz; Michael Bader

doi:10.1109/IPDPSW50202.2020.00126

Titel:: Vectorization and Minimization of Memory Footprint for Linear High-Order Discontinuous Galerkin Schemes
Dokumenttyp:: Konferenzbeitrag
Art des Konferenzbeitrags:: Textbeitrag / Aufsatz
Autor(en):: Jean-Matthieu Gallard; Leonhard Rannabauer; Anne Reinarz; Michael Bader
Abstract:: We present a sequence of optimizations to the performance-critical compute kernels of the high-order discontinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE -- successively tackling bottlenecks due to SIMD operations, cache hierarchies and restrictions in the software design. Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improving the data layout and using Loop-over-GEMM to perform tensor contractions via highly optimized matrix multiplication functions provided by the LIBXSMM library. We show that memory stalls due to a memory footprint exceeding our L2 cache size hindered the vectorization gains. We therefore introduce a new kernel that applies a sum factorization approach to reduce the kernel's memory footprint and improve its cache locality. With the L2 cache bottleneck removed, we were able to exploit additional vectorization opportunities, by introducing a hybrid Array-of-Structure-of-Array data layout that solves the data layout conflict between matrix multiplications kernels and the point-wise functions to implement PDE-specific terms. With this last kernel, evaluated in a benchmark simulation at high polynomial order, only 2\% of the floating point operations are still performed using scalar instructions and 22.5\% of the available performance is achieved. «
We present a sequence of optimizations to the performance-critical compute kernels of the high-order discontinuous Galerkin solver of the hyperbolic PDE engine ExaHyPE -- successively tackling bottlenecks due to SIMD operations, cache hierarchies and restrictions in the software design. Starting from a generic scalar implementation of the numerical scheme, our first optimized variant applies state-of-the-art optimization techniques by vectorizing loops, improving the data layout and using L... »
Horizon 2020:: supported by Horizon 2020 project nr. 671698 project name ExaHyPE and project nr. 823844 project name ChEESE
Kongress- / Buchtitel:: The 21st IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC-2020)
Jahr:: 2020
Quartal:: 1. Quartal
Jahr / Monat:: 2020-02
Monat:: Feb
Reviewed:: ja
Sprache:: en
Volltext / DOI:: doi:10.1109/IPDPSW50202.2020.00126
WWW:: https://doi.ieeecomputersociety.org/10.1109/IPDPSW50202.2020.00126
TUM Einrichtung:: Department of Informatics
BibTeX