HPC applications usually are not written in a way that they can cope with dynamic changes in the execution environment, such as removing or integrating new nodes or node components. However, for higher flexibility in regard to scheduling and fault tolerance strategies, ade- quate application-integrated reaction would be worthwhile. With legacy MPI codes, this is difficult to achieve. In this paper, we present LAIK, a lightweight library for distributed index spaces and associated data containers for parallel programs supporting fault tolerance features. By giving LAIK control over data and its partitioning, the library can free compute nodes before failure and do repli- cation for rollback schemes on demand. Applications become more adaptive to changes of available resources. We show an example of using LAIK and present first results on a prototype implementation.
«
HPC applications usually are not written in a way that they can cope with dynamic changes in the execution environment, such as removing or integrating new nodes or node components. However, for higher flexibility in regard to scheduling and fault tolerance strategies, ade- quate application-integrated reaction would be worthwhile. With legacy MPI codes, this is difficult to achieve. In this paper, we present LAIK, a lightweight library for distributed index spaces and associated data containers...
»