The ability to efficiently schedule jobs in a parallel processing environment is an important aspect of high-performance computing systems. In general, these jobs can include batch jobs and/or dedicated jobs. A batch job is one that does not have a user-specified start time and can be scheduled by a scheduler at some optimal time, depending on the scheduling protocol. A dedicated job is one having a user-requested start time that is fixed and not decided by a scheduler. Thus, unlike batch jobs, dedicated jobs are rigid in their start-times and must be commenced at the user-requested start time.
For homogeneous workloads comprising batch jobs only, the efficiency of a parallel processing computing system depends on how tightly packed the batch jobs can be scheduled so as to maximize system utilization while minimizing job wait times. At a high level, HPC (high performance computing) systems have generally used a queuing model to schedule incoming jobs, wherein most optimizations revolve around how an HPC system is packed and how the queue is managed to maximize system utilization while minimizing job wait times. Much of the complexity involves balancing the expected runtime needs of a given job against the scheduling of future jobs. Unpredictable wait times is a key issue in batch schedulers. For certain workloads, this unpredictability can be tolerated. For other workloads such as real-time workloads, however, better guarantees are required.
For example, for heterogeneous workloads comprising batch jobs and dedicated jobs, additional complexity arises because the process of scheduling flexible batch jobs around rigid dedicated jobs is non-trivial. Many scenarios in a parallel processing environment can be envisaged where some users need to run background simulation programs that are not time or deadline critical, while other users may require rigid and fixed time slots to execute jobs such as those for real-time traffic data processing during certain periods of the day/week, real-time geographical, satellite or sensor data processing during certain periods of the month/year. In this case, a single HPC scheduler must be capable of efficiently scheduling a heterogeneous workload of batch and dedicated jobs. State of the art HPC schedulers are designed for handling only batch jobs and are incapable of efficiently handling such heterogeneous workloads through a systematic and optimal methodology.
Furthermore, state of the art HPC schedulers for a parallel processing environment are generally optimized for submit-time elasticity of batch jobs only, where resource needs (e.g., user estimated job execution times) are specified only at submission time. Once batch jobs with user estimated execution times are submitted, they cannot be explicitly altered at runtime. Current HPC scheduling algorithms account for both scheduled termination (kill-by time), and premature termination before the user-estimated end time, but do not account for the inter-play of explicit, on-the-fly extensions or reductions in execution time, between batch and dedicated jobs. In other words, state of the art HPC schedulers are not designed for runtime elasticity of heterogeneous workloads, wherein runtime elasticity allows a user to change the execution time requirements (or other resource requirements) for a given job during execution of the given job. Adding runtime elasticity capability to a scheduling protocol, where jobs can expand and contract in their execution time on-the-fly, leads to even further complexity with regard to implementing an efficient scheduling algorithm to accommodate the runtime elasticity capability.