The present invention is generally directed to the problem of scheduling jobs to run in a parallel batch data processing system. More particularly, the present invention is directed to a system and method for ensuring the early starting of a job on a system component that is best able to handle it. Even more particularly, the present invention is directed to the use of such methods in data processing systems which include a mix of older and newer hardware components.
Parallel batch job schedulers for High Performance Computing (HPC) Machines are well known. Some current examples include IBM LoadLeveler, Sun GridEngine, Platform LSF and openPBS. In order to control the allocation of resources, individual computing nodes are grouped into job classes (also known as queues in some implementations). Note that nodes may be grouped into more than one job class at a time. Using this technique, resources are segregated for whatever reasons system a administrator desires. Often, HPC users upgrade systems yet retain older hardware in the same system. This results in a mix of nodes with different capabilities. If these different nodes are included in an identical job class, availability of mixed nodes for a single job often results in an overall job slowdown since parallel jobs tend to run only as fast as the slowest resource. It is natural, then, to segregate these different technologies into different job classes, for example, old and new. Once segregated, many jobs may run on either job class.
However, at job submission time, the user is not able to predict which job class provides free resources at the earliest time. If a poor selection is made, the job may wait for resources of one class to become free while the alternate class nodes are idle. Predictive techniques are difficult since dynamic changes in the job queue for both classes and jobs can occur at random times. Examples of random changes include jobs completing early, additional user jobs entering the job queue, and jobs deleted from the queue by users. The problem then is to provide a utility which delivers free resources from a set of disjoint job classes to an idle job on the job queue, with the intention of obtaining resources for the job as early as possible.