1. Field
The embodiments discussed herein are directed to management of execution of a job.
2. Description of the Related Art
In order to perform an enormous amount of computation, a method of executing a job including a plurality of processing units is used in computers (large general-purpose computers, in particular). The processing time of the job varies from a few hours to several weeks.
Accordingly, in some cases, it is required to suspend and restart execution of the job (perform the checkpointing and restart of the job) for some reason. The suspended job has to be restarted without fail. It is therefore important to manage and control the suspended job.
As a method of suspending execution of a job in a large general-purpose computer, for example, a job suspending method of allowing a virtual machine data processing system to perform the checkpointing and restart of a single job using a signal has been proposed.
Currently, information processing systems called grid computing systems in which computer apparatuses connected to a network cooperate with each other are becoming increasing popular.
In such a grid computing system, the load of an enormous amount of computation is distributed to computer apparatuses so as to cause the computer apparatuses to cooperate to perform the computation. Accordingly, as compared with large general-purpose computers in the related art, grid computing systems can perform computation processing at lower cost and in a shorter time.
However, each computer apparatus included in a grid computing system is preferentially used by a user of the computer apparatus. Accordingly, if a certain computer apparatus is being used by a user of the computer apparatus, it is required to suspend execution of a job in the computer apparatus when working with computers in the grid system. Furthermore, it is required to manage the suspended job so as to restart the suspended job.
For example, a dynamic service registry for a virtual machine has been proposed in which, when a virtual machine on a computer apparatus for executing a job instructs the checkpointing or restart of a job, a user of the computer apparatus serving as a computer resource notifies a management apparatus for a grid computing system of the instruction.
A topology aware grid services scheduler architecture for managing control of the checkpointing and restart of an online application using a Web service in a grid computing system has been proposed.
However, in the related art, there are the following problems. If a computation resource is operated by a specific OS (Operating System) such as an open-source OS, it is possible to manage and control the checkpointing and restart of a job.
However, if a computation resource is operated by a non-open-source OS, it is impossible to manage and control the checkpointing and restart of a job. Even if a computation resource is operated by an open-source OS, it is required to install a specific library. In this case, if there is no source code of a job, it is impossible to perform checkpointing of a job.
It is an object of the present invention to effectively perform the checkpointing and restart of a job in a grid computing system without using an OS of each computer apparatus serving as a computer resource.