1. Field of the Invention
The present invention relates to a job control device which is provided with a control configuration file reading part for reading a control configuration file, and controls submitting jobs to a job submit system based on the descriptive content of the control configuration file read in by the control configuration file reading part, and it also relates to such a control configuration file, a job control method and a job control program, which have been developed in particular for the purpose of reducing the time and effort of user's management when a batch of jobs are submitted.
2. Description of the Related Art
A conceptual diagram of a known job control device to which jobs are submitted is shown in FIG. 10. In this known device, a user submits jobs to a queue of the device while accompanying a script or a description file to submit the jobs, and job IDs are then returned to the user. The user confirms or ascertains the end or completion of each job by the use of e-mail, etc., or ascertains the situation of job processing while designating job IDs. After the end of the job, the result is taken out of the file. As such a batch job system, there have been known LSF of Platform Computing Inc., OpenPBS, Condor of the High Throughput Computing Team of University of Wisconsin, etc.
In any of the above-mentioned systems, a user can designate an option to receive an e-mail as the end or completion of one job each time the job has been ended or completed, so that he or she can confirm or ascertain the completion of the job. However, only a very small number of systems permit a user to confirm the completion or end of each job after all the plurality of jobs have been completed.
Although a plurality of jobs can be submitted from one job control file in the above-mentioned “Condor”, what can be designated are only the names of programs to be executed and their parameters. In cases where the programs to be executed are written in sh script or in perl script, it is necessary to manage a script file and a job control file, separately. Besides, the method of notifying when all the groups of jobs submitting at a time have been completed can only send an e-mail, but it is impossible to actuate or boot a general program or to write data or the like into a file.
In the systems other than the “Condor”, only a single job can be submitted at a time, and hence in order to submit a lot of jobs mechanically or in an automatic fashion, it is necessary to prepare a plurality of separate programs such as, for example, sh scripts, perl scripts or the like individually, from which the jobs can be submitted. As a result, there is a need to manage the scripts for submitting the jobs and the scripts for performing the jobs, separately. Moreover, it is necessary to correct the scripts for submitting the jobs upon submit of each job, and mistakes can be easily made. Even with the “Condor”, the sequence of jobs capable of being submitted can only be described, and when it is desired to submit similar jobs repeatedly, a complicated procedure is required in which each job has to be submitted after a job control file has been created automatically by a separate program.
Further, there have also been provided some methods of executing jobs while awaiting the completion of other jobs. For example, in “PBS”, such a procedure is taken in which after all jobs have once been submitted, the context or sequence of the jobs is separately designated by a command or commands with respect to job IDs that are obtained at the time of submitting the jobs. Since in this method, however, the job IDs can not be known until the jobs are submitted, a procedure is required in which after those jobs which are executed later are once submitted in an execution-hold state, the context or sequence thereof is first designated by using their IDs, and the hold state is then released. Thus, such a procedure is very complicated and fussy.
Furthermore, in the “Condor”, the sequence of execution between the jobs can be designated as a DAG (Directed Acyclic Graph), but this designation is described in a separate file in a format independent from the ordinary job control file, and it is executed by using a group of different commands. That is, if the file for the “DAG” and the job control file are both executed by scripts, it is necessary to manage three kinds of script files in mutually different formats at the same time. In addition, what can be described as a context or sequence is limited to jobs, even a simple work such as accumulation or tabulation of files or the like, requiring a very short execution time, has to be submitted as a job, so it needs to wait in a queue for a machine to become available.
In the past, the number of jobs to be submitted at a time is relatively small, e.g., from a few to tens, it is easy for a user to wait for being informed of the completion of the jobs by means of e-mail or the like. Additionally, if it becomes necessary to submit a job after a certain job has been completed, the user must submit that job at such a time by manual operation. However, at present, a lot of computers are connected to one another through networks by the use of a GRID technology or the like, so hundreds of computers become available. As a result, the number of jobs submitting by a user at a time also becomes huge, e.g., from hundreds to thousands. In this case, it is necessary to prepare scripts for submitting jobs or control files by manual operation or by executing a program separately, and hence there arises a problem in that much time and effort are required for the management thereof, and there easily take place various errors.
Still further, when the number of jobs submit by a user at a time becomes huge such as hundreds or thousands or more, there arises another problem in that even if the completion of each job is informed by an e-mail, the user can not properly perform the management thereof because of too many number of e-mails.
In addition, when jobs are executed by the use a lot of computers, they are executed concurrently and in parallel by means of the computers mutually connected to one another, so the jobs will be completed in random order. Thus, there arises the following problem. That is, when another work is performed after groups of necessary jobs have been completed, i.e., when the results of groups of jobs are arranged, collected or coordinated, or when another work is done while utilizing the job results, it is difficult to automatically check whether all the really necessary jobs have been completed, so it becomes necessary to carry out manual check at regular intervals, which is very time consuming.
Moreover, there is a problem in that it is also difficult to make another work performed automatically at the end of certain jobs. Further, when the number of jobs to be waited for is irregular or variable, i.e., when the number of jobs to be awaited is changed according to the submitting of a job or jobs, there arises an additional problem in that it is very difficult to describe the waiting of jobs in an appropriate manner. Also, there might be a case where a user wants to determine whether to submit another job or jobs based on data that are output by the jobs of concern. For example, various jobs are tried so that the result output thereby provides an optimal value or values. However, in such a case, there is the following problem. That is, it is difficult to determine, by executing a suitable program after the completion of jobs is awaited, whether the result obtained by each try provides an optimal value or values, and to further submit a job or jobs as a result of such determination.
Furthermore, when a user wants to wait for jobs, he or she sometimes wants to wait for a plurality of groups of jobs which are very similar to one another. For example, in cases where there are a plurality of works in which a first one must be done while waiting for all the jobs whose parameter A is 1, a second one must be done while waiting for all the jobs whose parameter A is 2, . . . , and the last one must be done while waiting for all the jobs whose parameter A is 10, there is a problem in that describing respective waitings for all the works ten times is most troublesome.
Still further, there is a case where it is desired to redo only part of a large amount of submitting jobs. For example, only a set of mutually related jobs among them are to be submitted while correcting part of an submitting file. In such a case, in the past, it is necessary to submit only the necessary jobs manually, but if the number of jobs to be submitted increases too much, it is difficult to submit them by manual operation, and hence the jobs should be submitted through a separate script or the like. In this case, however, it is problematic that preparing, updating and managing such a script in each time is troublesome.
Besides, it might take a very long time to complete jobs, and in this case, the software of this patent must be kept executed all the time during execution of the jobs, as a result of which the danger of abnormal termination of the software due to some causes such as hardware trouble or the like increases. In such a case, there is a problem that the effectiveness of the software of this patent is greatly impaired if there is no method of continuing the waiting.
Additionally, in part of job submitting mechanisms such as “PBS”, there is a problem in that once a job has been completed, information on the start and end of the job might not be issued, so it is impossible to know the time point at which the execution of the job is started or completed.