Stencil codes are commonly used to solve many problems. On parallel
heterogeneous systems with CPUs and GPUs, the domain is usually split
and assigned to GPUs, where it is further divided to GPU blocks. The
iterative distributed stencil computation consists of two steps – computation
and communication, where the subdomains exchange boundary data, also
called ’halo exchange’. On multi-node systems, it is crucial to efficiently
transfer data from one GPU to another via MPI, as a de-facto standard
solution in HPC.
In this master thesis, methods of GPU-to-GPU data exchange via MPI
are examined with focus on halo exchange. The thesis describes a design
of a set of naive baseline approaches and a set of optimized solutions
called taskqueue. The main idea behind the taskqueue approach consists in
overlapping packing and unpacking (computation) with host-to-host MPI
communication, and in reusing one kernel for both packing and unpacking
workloads to eliminate the kernel launch, termination, and synchronization
overheads. The implementation relies on pinned host memory, a segment
of main memory that is accessible by both the CPU and GPU, that the
parties use to communicate. A portable solution that runs on both NVidia
and AMD GPUs is designed, so that the differences on both platforms can
be observed.
The performance of the taskqueue approaches is evaluated against a baseline
reference on both and NVidia and AMD testbeds. The tests on NVidia
yield a stable speedup that ranges from 1.09 to 1.21 for different workload
sizes. Contrary to that, this approach did not prove useful on the AMD
testbed, as it needed more than 200 × as much time to finish. The main
reason for that are problems with concurrently reading from and writing to
one memory location by the CPU and GPU.
This observation, and other observations made mainly on the AMD testbed,
are identified and their implications are discussed in this work. It reveals
some rigours of platform-agnostic GPU development, and discovers some
unexpected behaviour patterns on the AMD GPUs combined with MPI
usage. Finally, optimization to the taskqueue algorithm are proposed so that
it would hopefully achieve better performance also on the AMD testbed.
«
Stencil codes are commonly used to solve many problems. On parallel
heterogeneous systems with CPUs and GPUs, the domain is usually split
and assigned to GPUs, where it is further divided to GPU blocks. The
iterative distributed stencil computation consists of two steps – computation
and communication, where the subdomains exchange boundary data, also
called ’halo exchange’. On multi-node systems, it is crucial to efficiently
transfer data from one GPU to another via MPI, as a de-facto sta...
»