Design and Implementation of a Lightweight Communication Backend for HPC/Distributed Applications

Alexander Kurtz

thesis.pdf

Wenn Sie Schwierigkeiten haben, das Dokument zu öffnen, versuchen Sie auch bitte diesen Link

Dokumenttyp:: Masterarbeit
Autor(en):: Alexander Kurtz
Titel:: Design and Implementation of a Lightweight Communication Backend for HPC/Distributed Applications
Übersetzter Titel:: Entwurf und Implementierung eines schlanken Kommunikationsbackends für verteilte HPC-Anwendungen
Abstract:: LAIK is a new library which makes writing Single-Program-Multiple-Data (SPMD) programs easier by abstracting the necessary inter-process communication. To achieve this, it requires the programmer to explicitly define the data containers shared between processes and to provide a mapping between simple integer indices and the specific data elements. Since LAIK then both knows about and manages all the shared data, it can automatically distribute the data among the available workers and facilitate the necessary communication to synchronize the individual processes. For actually sending and receiving data between the collaborating processes, LAIK relies on its backend which also provides the necessary management information used to determine the total amount of processes and the local ID among those. Originally, LAIK only had a single backend which used the MPI API to talk to various MPI implementations. Unfortunately, this meant that LAIK’s backend interface was not well tested and that a working MPI implementation was a prerequisite to run LAIK applications. In this work, we extend LAIK with a new backend using native TCP sockets provided by the operating system. We first introduce the design by presenting the main challenges to overcome and the solutions we found for them. Then we introduce the resulting implementation and evaluate it in comparison to the existing MPI backend. For this, we use a small cluster of single board computers (SBDs) with a custom OpenMPI installation and a fast network interconnect. We demonstrate that our new TCP backend can provide comparable performance to the existing MPI backend in most cases, with a few test case even showing signif- icantly lower total execution times with the new backend in use. Most importantly, we find that the new TCP backend works especially well if given many send/receive operations per invocation, with a medium or large amount of bytes to transmit per operation. Conversely, the performance worsens dramatically if the backend is given only a few (or just one) operation per invocation and/or very small messages to transmit. Finally, this work also showcases a way of bringing fault tolerance transparently to SPMD applications using LAIK by generating unique identifiers for each exchanged message. By simply not removing outgoing messages from the output buffer after successfully delivering them, we allow failed instances to be restarted (possibly even on a different host) and to regain their lost state by requesting the necessary messages once more. We also show that this approach to fault tolerance is possible while only using a subset of the MPI API to communicate with the core of LAIK, suggesting that this approach may also be used in existing MPI implementations. «
LAIK is a new library which makes writing Single-Program-Multiple-Data (SPMD) programs easier by abstracting the necessary inter-process communication. To achieve this, it requires the programmer to explicitly define the data containers shared between processes and to provide a mapping between simple integer indices and the specific data elements. Since LAIK then both knows about and manages all the shared data, it can automatically distribute the data among the available workers and facil... »
übersetzter Abstract:: LAIK is a new library which makes writing Single-Program-Multiple-Data (SPMD) programs easier by abstracting the necessary inter-process communication. To achieve this, it requires the programmer to explicitly define the data containers shared between processes and to provide a mapping between simple integer indices and the specific data elements. Since LAIK then both knows about and manages all the shared data, it can automatically distribute the data among the available workers and facilitate the necessary communication to synchronize the individual processes. For actually sending and receiving data between the collaborating processes, LAIK relies on its backend which also provides the necessary management information used to determine the total amount of processes and the local ID among those. Originally, LAIK only had a single backend which used the MPI API to talk to various MPI implementations. Unfortunately, this meant that LAIK’s backend interface was not well tested and that a working MPI implementation was a prerequisite to run LAIK applications. In this work, we extend LAIK with a new backend using native TCP sockets provided by the operating system. We first introduce the design by presenting the main challenges to overcome and the solutions we found for them. Then we introduce the resulting implementation and evaluate it in comparison to the existing MPI backend. For this, we use a small cluster of single board computers (SBDs) with a custom OpenMPI installation and a fast network interconnect. We demonstrate that our new TCP backend can provide comparable performance to the existing MPI backend in most cases, with a few test case even showing signif- icantly lower total execution times with the new backend in use. Most importantly, we find that the new TCP backend works especially well if given many send/receive operations per invocation, with a medium or large amount of bytes to transmit per operation. Conversely, the performance worsens dramatically if the backend is given only a few (or just one) operation per invocation and/or very small messages to transmit. Finally, this work also showcases a way of bringing fault tolerance transparently to SPMD applications using LAIK by generating unique identifiers for each exchanged message. By simply not removing outgoing messages from the output buffer after successfully delivering them, we allow failed instances to be restarted (possibly even on a different host) and to regain their lost state by requesting the necessary messages once more. We also show that this approach to fault tolerance is possible while only using a subset of the MPI API to communicate with the core of LAIK, suggesting that this approach may also be used in existing MPI implementations. «
LAIK is a new library which makes writing Single-Program-Multiple-Data (SPMD) programs easier by abstracting the necessary inter-process communication. To achieve this, it requires the programmer to explicitly define the data containers shared between processes and to provide a mapping between simple integer indices and the specific data elements. Since LAIK then both knows about and manages all the shared data, it can automatically distribute the data among the available workers and facil... »
Fachgebiet:: DAT Datenverarbeitung, Informatik
DDC:: 000 Informatik, Wissen, Systeme
Betreuer:: Yang, Dai
Gutachter:: Weidendorfer, Josef (Priv.-Doz. Dr.)
Jahr:: 2018
Seiten/Umfang:: 101
Sprache:: en
Sprache der Übersetzung:: de
Hochschule / Universität:: Technische Universität München
Fakultät:: Fakultät für Informatik
Annahmedatum:: 15.05.2018
Präsentationsdatum:: 24.05.2018
Publikationsdatum:: 15.05.2018
BibTeX

Vorkommen:

mediaTUM Gesamtbestand Elektronische Prüfungsarbeiten School TUM School of Computation, Information and Technology