Scheduling jobs on parallel computer systems is challenging due to high job submission rates, high utilization of system processors and memory, i.e. scarcity of free resources at any given time, and unpredictable arrival of jobs having various system resource requirements. In an effort to efficiently schedule jobs and optimize utilization of system resources, various job scheduling methods and systems have been developed employing backfill scheduling algorithms. “Backfill scheduling” (or backfill job scheduling) enables lower priority jobs to move forward in the schedule ahead of higher priority jobs as long as such movement does not cause any higher priority scheduled job to be delayed. In one particular backfill scheduler employing known as the “EASY backfilling algorithm,” jobs may be moved ahead in the schedule as long as such movement does not delay the first queued job.
Backfill scheduling technology, however, is essentially limited to scheduling dedicated homogenous nodes of a multi-node computer system or network, i.e. where all the nodes have identical capacities. This limitation effectively prevents current backfill scheduling technology from recognizing or distinguishing the capacities (e.g., CPUs per node and memory) of nodes in the scheduling set, or the differing resource requirements of jobs to be scheduled. Consequently, current backfill scheduling technology does not work properly when the node set contains nodes not all having identical or equal capacities, i.e. heterogeneous nodes. When used in a heterogeneous environment, these deficiencies may frequently result in: (1) erroneous priority scheduling when the resources required by the priority job are not available causing it to not start at the intended schedule time, (2) erroneous backfill scheduling when the resources required by the backfill job are not available causing it to not start at the intended schedule time and consequently causing the delayed start of higher priority jobs, or (3) erroneously not backfill scheduling as a result of computing the start time of a higher priority job to be sooner than it could really start.
This problem has heretofore been addressed by avoiding heterogeneous node environments altogether (running jobs on only homogenous node systems), or by separating heterogeneous nodes into homogeneous pools and requiring users to place their jobs into the correct pool for the job. While both methods permit user jobs to run, they do not fully utilize all of a system resources in an efficient manner.