In recent years there have been an exponential rise in the capabilities of the modern High Performance Computing (HPC) systems. Such trend poses new challenges for managing node-level resources such as compute cores, memory bandwidth, and shared cache. This has led to an increasing demand for effective resource management methodologies in HPC systems. As modern HPC systems are typically composed of fat and rich compute nodes, it is usually difficult to fully utilize all the in-node resources by a single application. Co-scheduling, i.e., co-locating multiple jobs in a space shared manner, offers a promising solution for improving overall system throughput. To this end, it is crucial to allocate the node resources to specific jobs based on their requirements. At the same time, during co-scheduling of multiple jobs, there is a further increase in the interference for the shared resources. Therefore, the significance of shared resource isolation increases during the allocation of resources to the co-located jobs. Furthermore, there have been a rise in heterogeneity of the node-level resources. GPU-based HPC systems are increasingly prevalent among top supercomputers. Hence, similar challenges are applicable to the GPU-based systems as well. Considering these trends, industry has started supporting several resource partitioning or isolation features designed for shared resources on both modern CPUs and GPUs.
Driven by this technological trend, we focus on co-scheduling and resource partitioning on modern CPU-GPU HPC systems. Specifically, for CPUs, our target is to harmonize the co-run job selections and diverse resource assignments in a NUMA-aware manner. Regarding GPUs, we explore hierarchical resource partitioning on latest NVIDIA GPUs, employing both finer-grained logical partitioning (MPS) and coarse-grained physical partitioning (MIG). To optimize resource management decisions, we implement a reinforcement learning-based solution, addressing CPU and GPU optimizations separately. Experimental evaluations demonstrates that our approach can improve the overall system throughput by up to 78.1% and 87.3% for CPU and GPU, respectively.
«
In recent years there have been an exponential rise in the capabilities of the modern High Performance Computing (HPC) systems. Such trend poses new challenges for managing node-level resources such as compute cores, memory bandwidth, and shared cache. This has led to an increasing demand for effective resource management methodologies in HPC systems. As modern HPC systems are typically composed of fat and rich compute nodes, it is usually difficult to fully utilize all the in-node resources by...
»