Adaptive Resource-Aware Batch Scheduling for HPC systems

Mohak Chadha

Master_thesis_Mohak_Chadha_final.pdf

Wenn Sie Schwierigkeiten haben, das Dokument zu öffnen, versuchen Sie auch bitte diesen Link

Dokumenttyp:: Masterarbeit
Autor(en):: Mohak Chadha
Titel:: Adaptive Resource-Aware Batch Scheduling for HPC systems
Übersetzter Titel:: Adaptive Ressourcen-bewusste Stapelplanung für HPC-Systeme
Abstract:: Distributed memory modern HPC systems consist of millions of compute nodes interconnected via a high-performance network. These systems are shared among thousands of users for submitting and executing their applications. The main software component responsible for scheduling and allocating resources to the user’s jobs is the middleware called Resource and Job Management System (RJMS). The main element of RJMS responsible for resource management and scheduling is the Batch Scheduler. The applications submitted by the users are typically large scale simulations related to astronomy, climate modeling, biology, fluid and molecular dynamics, etc. Moreover, the resource requirements of these applications can dynamically change during their runtime. Changing the resources of an application at runtime requires an adaptive parallel runtime system and a dynamic resource management system. No RJMS software currently supports the dynamic reconfiguration of running applications. Towards this, in this thesis, the scalable workload manager SLURM is extended to support the dynamic reconfiguration of resource-elastic queued applications written using the Invasive MPI adaptive library. Different SLURM binaries are extended to allow users to submit resource-elastic jobs in batch mode. The batch scheduler in SLURM is extended through a scheduling plugin to support the efficient combined scheduling of rigid and malleable applications. Moreover, multiple scheduling strategies for elastic applications are implemented and evaluated. Finally, the overhead for dynamic adaptation operations, i.e., expansion and reduction, is analyzed. «
Distributed memory modern HPC systems consist of millions of compute nodes interconnected via a high-performance network. These systems are shared among thousands of users for submitting and executing their applications. The main software component responsible for scheduling and allocating resources to the user’s jobs is the middleware called Resource and Job Management System (RJMS). The main element of RJMS responsible for resource management and scheduling is the Batch Scheduler. The applica... »
übersetzter Abstract:: Distributed memory modern HPC systems consist of millions of compute nodes interconnected via a high-performance network. These systems are shared among thousands of users for submitting and executing their applications. The main software component responsible for scheduling and allocating resources to the user’s jobs is the middleware called Resource and Job Management System (RJMS). The main element of RJMS responsible for resource management and scheduling is the Batch Scheduler. The applications submitted by the users are typically large scale simulations related to astronomy, climate modeling, biology, fluid and molecular dynamics, etc. Moreover, the resource requirements of these applications can dynamically change during their runtime. Changing the resources of an application at runtime requires an adaptive parallel runtime system and a dynamic resource management system. No RJMS software currently supports the dynamic reconfiguration of running applications. Towards this, in this thesis, the scalable workload manager SLURM is extended to support the dynamic reconfiguration of resource-elastic queued applications written using the Invasive MPI adaptive library. Different SLURM binaries are extended to allow users to submit resource-elastic jobs in batch mode. The batch scheduler in SLURM is extended through a scheduling plugin to support the efficient combined scheduling of rigid and malleable applications. Moreover, multiple scheduling strategies for elastic applications are implemented and evaluated. Finally, the overhead for dynamic adaptation operations, i.e., expansion and reduction, is analyzed. «
Distributed memory modern HPC systems consist of millions of compute nodes interconnected via a high-performance network. These systems are shared among thousands of users for submitting and executing their applications. The main software component responsible for scheduling and allocating resources to the user’s jobs is the middleware called Resource and Job Management System (RJMS). The main element of RJMS responsible for resource management and scheduling is the Batch Scheduler. The applica... »
Stichworte:: Dynamic resource management, malleable HPC applications, batch scheduling, SLURM
Fachgebiet:: DAT Datenverarbeitung, Informatik
DDC:: 000 Informatik, Wissen, Systeme
Betreuer:: Gerndt, Michael (Prof. Dr.)
Gutachter:: John,Jophin
Jahr:: 2020
Sprache:: en
Sprache der Übersetzung:: de
Hochschule / Universität:: Technische Universität München
Fakultät:: Fakultät für Informatik
BibTeX

Vorkommen:

mediaTUM Gesamtbestand Elektronische Prüfungsarbeiten School TUM School of Computation, Information and Technology