The present invention relates to a parallel computer system, and, in particular, it relates to a parallel computer system which can continue or complete stable parallel computing process without discontinuity by automatically detecting various troubles that may occur during computations, thereby automatically avoiding the troubles.
In recent years, performance of computers has improved dramatically. Further improvement of the performance is required in an actual field of designing. In case of performing a design optimization of automobiles, aircrafts or the like, several years or even several hundred years of computing time may be needed even with the most advanced machine (e.g., 3.6 GHz Xeon machine) with simplified parts. For example, as for a design optimization using an evolutionary algorithm, where an evolutionary optimization is performed with 100 individuals and 500 generations of evolutionary optimization, which are typical values in the art, and one day is needed for evaluation of one individual using computational fluid dynamics, a required computing time would be 1 (day)×100(individuals)×500(generations)=50,000(days) or about 137 years.
In order to resolve such problem of computing time, parallel computer systems (PC clusters or the like) are often used. A parallel computer system, comprising multiple computers interconnected over networks, operates to divide a large scale computation into smaller computation blocks for processing by separate computers, shortening the time required for obtaining a computation result.
Although the performance and/or the stability of the parallel computer system have improved significantly, a problem of “failure rate” still exists in the parallel computer system. A “failure rate”, which represents a possibility of a trouble occurring at any point within a system, can be expressed by using a notation Pbroken as in the following equation:Pbroken=1−(1−p)n where “p” represents a failure rate of each of the parts constituent the system and “n” indicates the number of the parts included in the system.
The parallel computer system is formed by interconnecting multiple computers (a master node and slave nodes) over networks, which entails a a large number of parts such as cables for interconnecting the computers and the networks, failure rate of the parallel computer system becomes much higher than a standalone computer. As the parallel computer system becomes larger and the number of constituent computers becomes larger, the number of the parts for the overall computers becomes high as well, and accordingly, the failure rate of the whole parallel computer system tends to reach one, a state that at least one failure always exists somewhere in the system. The failure rate problem is a factor causing instability in computation by the parallel computer system.
In order to cope with this kind of problem, some conventional approaches include a method of using a so-called checkpoint-restart function for restarting the computing process manually after repairing the parallel computer system as described in Japanese Patent Application Publications Nos. 2002-288149, H10-116261 and 2002-366535. Another conventional approach is a method of disabling computation at the computing node in which any abnormality is detected as described in JPAP Nos. 2003-203061, 2004-38654 and H6-161976.
However, when the checkpoint-restart function is utilized, automatic monitoring of the trouble and automatic recovering from the trouble cannot be carried out by the computer. Such works need be done by a system administrator. Moreover, since the process of the computer is stopped once the trouble occurs, the checkpoint-restart scheme is inefficient in an environment such as in design optimization using evolutionary optimization, which would take a long time, say several months or several years for one computation even with a parallel computer system.
On the other hand, according to the above-referenced method of disabling the computation in the computing node in which the abnormality has occurred, some specific troubles of hardware such as a communication problem with the computing node can be avoided to a certain extent by isolating a relevant computing node. However, the above-referenced publications (the JPAP Nos. 2003-203061, 2004-38654 and H6-161976) do not describe any method for coping with other troubles, which may occur with higher possibility in the parallel computer system. Such other problems include abnormality of the software aspect of the network system of the computing node, crash and/or hang-up of the computing program, overcapacity of the hard disk (HDD), abnormality in the I/O system, and abnormality in the parallel virtual machine (PVM) and the message passing interface (MPI), which are software components of the parallel computer system.
Thus, it is an objective of the present invention to provide a parallel computer system which provides a stable parallel computing process without discontinuity by automatically detecting various troubles that may occur during computations with the parallel computer system. The system automatically copes with the troubles in an environment such as design optimization using evolutionary optimization scheme which would need a long time for one computation.