Scheduling parallel workloads in a High Performance Computing (HPC) cluster is an increasingly complex task, especially when it concerns scalability and performance of the scheduling agent. This is because clusters are being used to solve extremely large and complicated problems. This has led to an increase in the number of nodes required to execute a parallel job by an order of magnitude or more. By implication, the total number of nodes in a typical HPC cluster has gone up by an order of magnitude as well.
When hundreds of compute agents running across the cluster attempt to report the status of a job to a scheduling agent running on a single node, the scheduling agent quickly becomes a performance bottleneck under the heavy communication load. In many cases, this scenario could also lead to the failure of the scheduling agent.
Most batch schedulers attempt to provide a scalable mechanism for submitting or starting execution of a parallel job. Usually, this involves an efficient one-to-many communication scheme. When a large number of compute agents running on different nodes in the cluster report job status to a single scheduling agent running on a single node, the communication load can overwhelm the scheduling agent.
One solution could be to serialize processing of status reports at the scheduling agent, while another could be to proceed with job scheduling steps without collecting a complete report of the current job status. While the former approach results in a performance bottleneck, the latter causes delays in recognizing failures which in turn affects the reliability of the scheduling agent.