1. Field of the Invention
The present invention relates to a system for and method of collecting dump information in a data processing system in which a plurality of computation processors execute a parallel processing program, and more specifically, to a system for and method of collecting a plurality of dumps in a parallel computer system with distributed memory architecture.
2. Description of the Related Art
Stand-alone computer systems generally collect a memory dump of their main storage or secondary storage when they have gone down, and output it to external storage for troubleshooting of their operating system.
In contrast to the stand-alone systems, parallel computer systems with distributed memory architecture have a plurality of computation processors interconnected with a network, where each computation processor incorporates an independent CPU and memory to perform a concurrent computation under the control of a common parallel processing program, making data transfer and synchronization via the network. The above-described dump collection in the case of a system failure is carried out also in such parallel computer systems.
Take a conventional computer system organized by multiple computation processors, for instance, and assume that some of the computation processors are executing a common parallel processing program. If a failure is detected in one of such processors, all the processors executing the parallel processing program will be aborted in the middle of their operation. Subsequently, the dumps of those computation processors will be collected and outputted as files to be stored in an external storage unit.
Among those computation processors that have offered their dumps, the computation processors other than the failed one are then restarted after the completion of the dump collection, because they must have no problem.
In the above-described situation, there arises a quite reasonable demand that the system down time (i.e., a period of time from abort to restart) due to a trouble be minimized.
Unfortunately, however, it generally takes a long time to finish writing the dumps from the computation processors into the external storage unit. Furthermore, increasing memory consumption in modern computation processors makes the time necessary for writing the dumps longer and longer. The total system down time also increases in proportion to the number of computation processors subject to the dump collection.
All those things are serious disadvantages to the users, and therefore, it is essential to reduce the system down time as much as possible.