LAIK is a new library which makes writing Single-Program-Multiple-Data (SPMD)
programs easier by abstracting the necessary inter-process communication. To
achieve this, it requires the programmer to explicitly define the data containers
shared between processes and to provide a mapping between simple integer indices
and the specific data elements. Since LAIK then both knows about and manages
all the shared data, it can automatically distribute the data among the available
workers and facilitate the necessary communication to synchronize the individual
processes.
For actually sending and receiving data between the collaborating processes, LAIK
relies on its backend which also provides the necessary management information
used to determine the total amount of processes and the local ID among those.
Originally, LAIK only had a single backend which used the MPI API to talk to various
MPI implementations. Unfortunately, this meant that LAIK’s backend interface
was not well tested and that a working MPI implementation was a prerequisite to
run LAIK applications.
In this work, we extend LAIK with a new backend using native TCP sockets provided
by the operating system. We first introduce the design by presenting the main
challenges to overcome and the solutions we found for them. Then we introduce
the resulting implementation and evaluate it in comparison to the existing MPI
backend. For this, we use a small cluster of single board computers (SBDs) with a
custom OpenMPI installation and a fast network interconnect.
We demonstrate that our new TCP backend can provide comparable performance
to the existing MPI backend in most cases, with a few test case even showing signif-
icantly lower total execution times with the new backend in use. Most importantly,
we find that the new TCP backend works especially well if given many send/receive
operations per invocation, with a medium or large amount of bytes to transmit
per operation. Conversely, the performance worsens dramatically if the backend is
given only a few (or just one) operation per invocation and/or very small messages
to transmit.
Finally, this work also showcases a way of bringing fault tolerance transparently to
SPMD applications using LAIK by generating unique identifiers for each exchanged
message. By simply not removing outgoing messages from the output buffer after
successfully delivering them, we allow failed instances to be restarted (possibly
even on a different host) and to regain their lost state by requesting the necessary
messages once more. We also show that this approach to fault tolerance is possible
while only using a subset of the MPI API to communicate with the core of LAIK,
suggesting that this approach may also be used in existing MPI implementations.
«
LAIK is a new library which makes writing Single-Program-Multiple-Data (SPMD)
programs easier by abstracting the necessary inter-process communication. To
achieve this, it requires the programmer to explicitly define the data containers
shared between processes and to provide a mapping between simple integer indices
and the specific data elements. Since LAIK then both knows about and manages
all the shared data, it can automatically distribute the data among the available
workers and facil...
»