1. Field of the Invention
The present invention relates to a parallel computing method, a parallel computing program and computer for processing plural jobs in parallel by using a master-worker type computer system. More specifically, the invention relates to a parallel computing method and a parallel computing program and computer enabling computation to be continued without conspicuously deteriorating execution performance even in the case where part of computers becomes heavily loaded or defective for some reason.
2. Related Art
As a method of preventing processing speed from decreasing in the case where plural computers for processing jobs in parallel exist, there is a method called load balancing. In the load balancing, the amount of jobs assigned to a heavily-loaded computer is reduced and the jobs are assigned to other computers, thereby equalizing the loads and preventing decrease in processing speed.
In the parallel computation using the load balancing, there is a case such that the load on a specific computer dynamically changes due to an external factor during execution of a job and, for example, increases. If increase in the load on the specific computer during execution of the job is known before execution of the job, by decreasing jobs to the specific computer and assigning the jobs to other computers, increase in the load can be suppressed. It is however generally difficult to predict fluctuations in loads and therefore difficult to properly allocate jobs. Therefore, when the load on a computer increases due to an external factor during parallel computation, a problem occurs such that finishing of the job in the computer delays and, as a result, the processing speed of the whole parallel computation decreases.
On the other hand, in the parallel computation using load balancing, there are not only the case where the load on part of computers becomes heavy, but also a case where a computer stops due to a failure during execution of a job.
There are roughly the following three methods as methods of dealing with the case where a computer stops due to a failure.
As the first method, spare hardware is prepared to obtain fault tolerant feature. For example, there is a method of constructing a system by using duplicated hardware performing the same operation and a comparator in order to deal with a failure. In the method, however, the configuration of the hardware becomes large-scaled and the cost increases.
As the second method, middleware is allowed to run in each of computers to thereby provide fault tolerance. For example, there is a method of using the techniques of check pointing and migration as basic techniques for realizing fault tolerance. The check pointing is a technique of storing an execution image of a job at a certain time point. The migration is a function of migrating the stored execution image to another computer and re-executing calculation. Both of the computers store the execution image, and a job of the computer which is down is executed by another computer, thereby preventing parallel computation from being interrupted. It is, however, exaggerated to provide all of computers with the middleware performing such check pointing and migration and a problem arises such that periodical check pointing creates excessive overhead in execution of a job. In the method, the job is re-executed at the time point when a failure is found, so that execution performance of the parallel process is much lower than that in the case where there is no failure.
In the third method, all of jobs are multiplexed and executed. In the method, however, the number of jobs increases only by the amount of multiplexing. For example, in the case of duplicating a job, the number of jobs is doubled.
As described above, conventionally, it is difficult to prevent the overall processing speed from decreasing in the case where the load on part of computers increases during parallel process or part of computers stops.