(1) Field of the Invention
The present invention relates to a job management device which instructs a plurality of calculation nodes to execute a job, a cluster system in which the job is managed by the job management device, and a computer-readable medium storing a job management program which is to be executed by a computer so as to realize the functions of the job management device. In particular, the present invention relates to a job management device in which high reliability against failure in the functions of the cluster system is ensured, a cluster system in which jobs are managed by such a job management device, and a computer-readable medium storing a job management program which realize high reliability against failure in the functions of the cluster system.
(2) Description of the Related Art
In order to perform advanced scientific calculation, a number of computers are bundled and used as a single computer system. Hereinafter, such a computer system is referred to as a cluster system, and each computer constituting the cluster system is referred to as a node. Each user uses another computer and inputs a request for calculation to the cluster system. Hereinafter, processing executed in response to such a request is referred to as a job. The job may be a parallel job or a sequential job. The parallel job is executed by a plurality of nodes, and the sequential job is executed by a single process on a single node.
When the cluster system receives a job, the cluster system is required to make one or more nodes execute the job. Therefore, a node which assigns the job to one or more other nodes and manages the operational status of the job at each node is provided. Hereinafter, such a node is referred to as a management node, and the nodes other than the management node are referred to as calculation nodes.
In the cluster system, the management node is aware of the assignment of jobs to the calculation nodes and the operational status of the calculation nodes. Therefore, when the management node fails, information on the assignment of jobs to the calculation nodes and the operational status of the calculation nodes will be lost.
In the case where the cluster system does not have a function of ensuring high reliability against failure in the management node (which is hereinafter referred to as the reliability-ensuring function), even the information on the existence of the job is lost when the management node fails. In this case, it is necessary to re-input the job into the cluster system, and complete jobs currently executed in the calculation nodes before the re-input of the job. However, if the reliability-ensuring function is not provided, there is no way to know the jobs which the cluster system has received before the failure, so that it is necessary to restart all the calculation nodes.
Consequently, the reliability-ensuring function is used. In the conventional reliability-ensuring function, information on the status of the jobs is stored in a file in a job database (DB), which is constructed on a hard disk drive (HDD). Even when the management node fails, it is possible to recognize the currently executed jobs by reading the one or more files from the job database on startup of the management node.
Incidentally, there are two job states, the job-input state and the job-execution state. The cluster system receives a request from another computer in the job-input state, and the job is assigned to and executed by one or more calculation nodes in the job-execution state.
If only the job-input state is managed in the job database, it is possible to recognize each job which should be executed, even after the management node fails. Therefore, the re-input of the job by the user is unnecessary. However, in this case, the operational status of each job is unknown, so that it is necessary to once stop execution of jobs currently executed by the calculation nodes, and restart the jobs from the beginning. Therefore, the operational efficiency of the system decreases. In particular, the scientific calculation needs jobs which take much time to execute, so that re-execution of such jobs from the beginning is extremely inefficient.
Thus, conventionally, both of the job-input state and the job-execution state are managed in a file in the job database constructed on a hard disk drive. In this case, it is possible to recognize the job-execution state, as well as the job-input state, even after the management node fails. Therefore, it is unnecessary to re-input each job into the cluster system, and it is possible to continue the jobs currently executed by one or more calculation nodes as they are. In addition, even when the management node fails, it is unnecessary to execute each job from the beginning. However, in order to store the job-execution state, delay occurs in reflecting information in the hard disk drive as explained below.
According to the ordinary OS (operating system), even when writing of information in a file is instructed, information is written in only a memory in order to increase the operational speed of the system. Thereafter, the updated information in the file is actually written in the disk at predetermined times. That is, the information written in the memory is not immediately reflected in the disk, so that delay can occur all the times. Thus, conventionally, the job-execution state in the disk is not necessarily updated, and the ensuring of the reliability is impeded.
In order to solve the problem of the delay caused by the writing operation, it is possible to consider to reflect in the file in the disk the information written in the memory at the same time as the writing in the memory. However, in this case, since the writing in the hard disk drive is far slower than the writing in the memory, the processing in the system is delayed.
In order to solve the above problem, a technique for ensuring the reliability without using the job database is considered. For example, it is possible to store information on the job-execution state and the environmental settings (which are held by the management node) in the calculation nodes as well as in the management node. When the management node fails, one of the calculation nodes is raised to a management node, so that the calculation node has both the functions of the calculation node and the functions of the management node. Then, the above calculation node collects the job-execution states in the other calculation nodes and executes the functions of the management node. (See, for example, Japanese Unexamined Patent Publication No. 6-96041, which is hereinafter referred to as JPP6-96041.)
However, the technique for ensuring the reliability disclosed in JPP6-96041 has the following problems (i) to (v).
(i) JPP6-96041 discloses no way of coping with parallel jobs. In the case of the parallel job, it is necessary that a plurality of calculation nodes execute the parallel job in corporation with each other. In order to realize the corporation, one of the plurality of calculation nodes which execute the parallel job becomes a job master, and manages details of the job-execution state. In this case, it is difficult to cope with the parallel job unless information unique to the parallel job (such as the information indicating which one of the calculation nodes is the job master) can be restored.
(ii) It is necessary that the calculation nodes store all the information on the parallel job (i.e., all the information necessary for execution of the parallel job). Therefore, in some cases, a great amount of information is required to be transmitted.
(iii) The technique disclosed in JPP6-96041 cannot cope with double failure. That is, when the management node and one or more of the calculation nodes concurrently fail, information indicating the existence of one or more jobs executed by the one or more failed calculation nodes is lost, so that external re-input of the one or more jobs is required. Therefore, it is necessary to externally check whether or not information on jobs which have already been inputted is lost. In other words, unless the information indicating whether or not the information on the jobs which have already been inputted is lost is externally managed, it is impossible to reproduce the lost information on the jobs, so that the reliability of the cluster system deteriorates.
(iv) In order to collect necessary information, access to all the calculation nodes is required. Therefore, processing for the communication becomes inefficient. In particular, in the systems used in scientific calculation, the number of the calculation nodes tends to be great. Thus, when information for managing all the calculation nodes is transferred to a calculation node, the amount of transferred data becomes extremely great.
(v) According to the technique disclosed in JPP6-96041, an attempt to protect information is made by using only the calculation nodes. Therefore, when one or more calculation nodes fail, it is impossible to re-execute one or more jobs which have been executed by the one or more calculation nodes before the failure.
If all or parts of the above problems are solved, it is possible to prevent decrease in processing efficiency while ensuring the reliability of the cluster system.