High performance computing (HPC) includes a plurality of computing nodes that execute jobs and a management node that manages job execution instructions to the computing nodes. When a job execution request is entered, a scheduler in the management node determines a job execution schedule to execute the job in batch processing.
In job scheduling in HPC, jobs are executed in batch processing. Therefore, a job that the user wants to execute is not always executed immediately. For example, a job execution request is placed into a queue. Generally, the job that is queued first is executed first. However, the order in which jobs are executed may be changed in accordance with the degree of parallelism for each job. The degree of parallelism (DOP) for a job indicates the number of computing nodes that execute the job in parallel. For example, in some cases, while an earlier-queued job with a high degree of parallelism is waiting for execution until the number of computing nodes corresponding to the degree of parallelism for the job become available, a later-queued job with a low degree of parallelism is executed first.
Various techniques are used in job scheduling. For example, there has been proposed a job schedule change support system that obtains a practical scheduled execution time period when changing a preregistered schedule to execute a job scenario. There has also been proposed a job scheduling method capable of reducing the risk that a job schedule fails, and capable of presenting the degree of optimization of job schedule. There has also been proposed a parallel computing control apparatus capable of reducing the search time in a multi-job system that searches a parameter space.
See, for example, Japanese Laid-open Patent Publications No. 2010-231694, No. 2005-11023, and No. 2013-140490.
Jobs executed in HPC include various types of jobs. Among those, there is a job that is run repeatedly as in the case of test runs before the actual run. It is important for such a job to minimize the turnaround time. Turnaround time is the time from when an execution request is entered to when the output of the execution result is completed.
The turnaround time of a job greatly varies depending on the degree of parallelism for the job. For example, if the degree of parallelism is low, it takes a long time to execute the job, but the waiting time for execution of the job tends to be short. On the other hand, if the degree of parallelism is high, it takes a short time to execute the job, but the waiting time for execution of the job tends to be long. The degree of parallelism that minimizes the turnaround time is dependent on the number of available computing nodes that are not executing a job, among computing nodes as the job submission destinations in the queue to which the job is submitted. For example, in the case of a queue having a sufficient number of computing nodes to execute a job waiting for execution, the turnaround time may be reduced by increasing the degree of parallelism. In the case of a queue not having a sufficient number of computing nodes to execute a job waiting for execution, the turnaround time is likely to be reduced by reducing the degree of parallelism and starting execution of the job early.
In this way, the degree of parallelism appropriate to reduce the turnaround time is dependent on the status of jobs waiting for execution in a queue, and the resource amount of computing nodes. However, the status of jobs in the queue changes from moment to moment. Therefore, it is difficult for the user to appropriately determine a job execution condition, such the degree of the parallelism and the queue into which a job is be placed, so as to reduce the turnaround time.