1. Field of the Invention
The present invention relates to a computer system, and more particularly relates to a computer system in which a plurality of computer nodes are coupled to build up a single multi-processor.
2. Description of the Related Art
Recently, a parallel computer system is practiced in which a plurality of computer nodes capable of operating independently are coupled to build up a tightly coupled multi-processor. The parallel computer system has advantages of facility of operation and management, high efficiency processing, and efficient usability of resources. One problem in such a parallel computer system is in appropriate management of fault data. To operate the parallel computer system appropriately, it is necessary for a processor to control the parallel computer system and to collectively manage the fault data of each computer node.
Japanese Laid Open Patent Application (JP-A-Heisei 8-263329) discloses a parallel computer system which integratedly manages fault data by using a service processor. The conventional parallel computer system includes the service processor in addition to a master processor and slave processors. The master processor, the slave processors, and the service processor are connected with each other through a diagnosis path. The diagnosis path is used only for management of the fault data. When a fault occurs in one of the slave processors, the slave processor stores a fault data (log data) in its own processor. The service processor reads out the fault data from the fault occurred slave processor through the diagnosis path and transfers the fault data to the master processor. Thus, the master processor can integratedly manage the fault data of the slave processors.
However, it is not preferable to provide a dedicated service processor in the light of costs. Especially, it is not preferable to provide such a service processor when computers designed in accordance with standard PC architecture such as PC/AT (Personal Computer/Advanced Technology) are used as computer nodes to realize a low-cost parallel computer system. In addition, it is necessary to adopt a management method of the fault data so as to overcome restraints due to the PC architecture.
When the computers designed in accordance with the standard PC architecture are used as the computer nodes, a tightly coupled multi-processor configured by coupling the computer nodes must also operate in accordance with the PC architecture. One of restraints in such a computer system is in that only one bridge circuit (typically, a south bridge) is permitted to connect the computer node and peripheral devices. This is an important constraint on the management of the fault data. In general, the fault data in each computer node is stored in non-volatile memory (NVRAM) managed by the south bridge of each computer node. However, after the computer nodes are coupled to each other to build up a multi-processor, the multi-processor can use only one south bridge. In the multi-processor, the south bridges other than the south bridge of a selected computer node cannot be used. Access to the NVRAM managed by the south bridge of the non-selected computer node is not permitted. This means that the tightly coupled multi-processor cannot refer to the fault data stored in the NVRAM managed by the non-selected south bridge after the coupling of the computer nodes. In other words, it means that the fault data are not succeeded to the tightly coupled multi-processor after the coupling of the computer nodes. This is a problem on proper operation of the tightly coupled multi-processor. The fault data in each computer node at the time of operation of the tightly coupled multi-processor must be succeeded to each computer node when the tightly coupled multi-processor has been separated. Therefore, the succession of the fault data is important when the tightly coupled multi-processor is separated into the computer nodes and each computer node initiates the operation independently.
For this reason, in the computer system in which a plurality of computer nodes are coupled to build up the tightly coupled multi-processor, it is required that proper management of the fault data of each computer node, e.g., proper succession of the fault data before and after the coupling and separation of the computer nodes can be realized in lower cost. Especially, it is desired to accomplish the requirement of the proper management while the constraints to the standard PC architecture are overcome.
In conjunction with the above description, Japanese Laid Open Patent Applications (JP-A-Heisei 11-212836, JP-A-2000-194584, JP-A-2001-109702, JP-A-2002-91938) disclose management techniques to collect fault data or system data from a plurality of computers. However, these applications never disclose concerning about succession of fault data between each of computer nodes and the tightly coupled multi-processor.
Also, a parallel computer system is disclosed in Japanese Laid Open Patent Application (JP-A-Heisei 8-6909). In this conventional parallel computer system, a single service processor controls a plurality of processors through a diagnosis path. Each of the plurality of processors has a log data acquiring section to acquire log data when a fault has occurred in the processor, and a storage circuit to store the log data. An error notifying circuit of the processor notifies the occurrence of the fault. A diagnosis path control circuit of the processor controls the diagnosis path. One of the plurality of processors as a specific processor has an interrupt analyzing section to receive the notice of the fault occurrence. In the specific processor, a log data receiving section of the specific processor reads out the log data from the service processor, and a log data write section stores the log data supplied from the log data receiving section. The service processor has a diagnosis path control circuit to control the diagnosis path. In the service processor, a node selecting section selects one processor which has issued a request through the diagnosis path, and a log data collecting section collects the log data from the selected processor. An error notifying section notifies the request received from the processor to a master processor.