This work presents a variant of the conjugate gradient (CG) method with minimal memory access for the
vector operations, targeting high-order finite-element schemes with fast matrix-free operator evaluation and cheap
preconditioners like the matrix diagonal. The algorithm relies on a data-dependency analysis and interleaves the vector
updates and inner products in a CG iteration with the matrix-vector product. As a result, around 90% of the vector
entries of the three active vectors of the CG method are transferred from slow RAM memory exactly once per iteration,
with all additional access hitting fast cache memory. Node-level performance analyses and scaling studies on up to
147k cores show that the new method is around two times faster than a standard CG solver as well as optimized
pipelined CG and s-step CG methods for large sizes that exceed processor caches, and provides similar performance
near the strong scaling limit.
«
This work presents a variant of the conjugate gradient (CG) method with minimal memory access for the
vector operations, targeting high-order finite-element schemes with fast matrix-free operator evaluation and cheap
preconditioners like the matrix diagonal. The algorithm relies on a data-dependency analysis and interleaves the vector
updates and inner products in a CG iteration with the matrix-vector product. As a result, around 90% of the vector
entries of the three active vectors of the C...
»