There is a class of computer systems which execute requested jobs by using a plurality of processors, processor cores, or computing nodes managed as computing resources. For example, a computer system designed for high performance computing (HPC) includes a plurality of computing nodes as the resources for execution of jobs. Also included is a managing node that manages the schedule of jobs executed on the computing nodes. This managing node performs scheduling of jobs so as to use the computing nodes in an efficient way.
The noted computer system executes various kinds of jobs, which may be categorized into serial jobs and parallel jobs. Serial jobs are executed on a single computing resource. Parallel jobs are executed on a plurality of computing resources in a parallel fashion. In the context of parallel job processing, the term “degree of parallelism (DOP)” refers to the number of computing resources used concurrently to execute a parallel job. Different jobs may take different lengths of time to execute. Some jobs finish in a relatively short time (e.g., a few minutes to a few hours), while other jobs consume a relatively long time (e.g., a few days to a few weeks).
In view of the above aspects of jobs, job scheduling is performed for each computing resource on the basis of, for example, the types of jobs (serial or parallel), the degree of parallelism in the case of parallel jobs, and the maximum execution time of each job. One proposed scheduling system is designed to achieve a high usage ratio of at least one central processing unit (CPU). Another proposed system improves the efficiency of job scheduling, taking advantage of checkpoint and restart services. The checkpoint and restart functions enable an ongoing job to stop at a certain checkpoint and restart afterwards from that checkpoint. See, for example, the following documents:
Japanese Laid-open Patent Publication No. 2010-182199
Duell, J.; Hargrove, P.; and Roman, E., “Requirements for Linux Checkpoint/Restart,” Berkeley Lab Technical Report (publication LBNL-49659), May 2002
As mentioned above, jobs may be suspended in the middle of their execution for the purpose of efficient scheduling. While some users can enjoy the advantages of this job suspension, other users may suffer a loss. For example, a typical job scheduler coordinates execution of jobs in such a way that the jobs are executed in the order they are requested. This orderly job execution is no longer the case when the scheduler is allowed to use job suspension. That is, the resulting schedule may stop an earlier-arriving job to execute a later-arriving job, and when this happens, the user who has requested the former job suffers a loss of time. Frequent occurrence of such swapping of execution order would lead to inequality of services to the users.
As can be seen from the above, the use of job suspension in scheduling is expected to improve the usage ratio of computing resources while it could bring some loss to the users. Conventional schedulers, however, suspend jobs without sufficient consideration of a balance between the expected improvement of resource usage and the risk of loss.