1. Field of the Invention
The present invention relates to a computer system, and more particularly to a computer system capable of preventing a failure occurring in one of the computers constituting the computer system from propagating to the rest of the computers.
2. Description of the Related Art
A typical conventional computer system featuring capabilities to prevent a failure occurring in one of the computers constituting the computer system from propagating to the rest of the computers, achieves these capabilities as follows.
FIG. 5 is a block diagram showing a typical configuration of a conventional computer system. In the computer system shown in the diagram, a plurality of computers 100, 200 are connected with each other through a network 500, and operate in coordination with each other as a cluster system.
As shown in FIG. 5, computers 100, 200 constituting this type of computer system are referred to as “nodes.”
Nodes 100, 200 each includes a plurality of CPUs 101 to 10n; a system controller 111 that is connected to each of the CPUs 101 to 10n; a main memory 112 for containing information concerning the operation of the system controller 111 and so forth; an IO controller 114 for controlling the input and output of the information processed by the system controller 111; a network adapter 115 for connecting the node bodies 100, 200 and the network 500 electrically; an IO bus 113 for connecting the system controller 111, the IO controller 114, and the network adapter 115 with one another; and an inter-node connection bus 116 for connecting the node bodies 100, 200 and the network 500 physically.
For this type of computer system, operational continuity is ensured by improving fault tolerance through increased system redundancy or by improving system performance through parallel job execution by two or more nodes 100, 200, so that the entire system will not be down even when one of its nodes fails.
In such a cluster system, jobs executed by the individual nodes 100, 200 are started as different processes independent of each other. By this, when a failure occurs in one of the nodes, the failing node can be isolated from other nodes; the job being executed by the failing node can then be re-executed or resumed by a good node, thereby improving the availability of the system.
In a typical conventional cluster system, the communications channel between nodes 100, 200 consists of a communication network 500, notably Ethernet (R) or a fiber channel.
In recent years, a new type of cluster system has appeared. As shown in FIG. 6, this type of cluster system has a plurality of processors. It can achieve an ultra-high speed inter-node communications by logically dividing a medium- or large-scale distributed shared memory system into units of distributed memories and by using remote memory access for inter-node communications. The internal configuration of each node of this cluster system is similar to that of the individual nodes 100, 200 shown in FIG. 5, except that the former node uses a cross-bar switch 500′ instead of a network 500.
When used as a single computer, the distributed shared memory system shown in FIG. 6 uses all the memory spaces formed by local and remote memories as a single own memory space. In cluster operation mode, on the other hand, only local memories of processor groups are used as an own memory; in this case, access to a remote memory is used as an inter-node access from one node to another.
When using this mode of operation, a cluster system with extremely highly efficient inter-node access paths can be provided, because inter-node access can attain a performance level similar to that of a remote memory in a single distributed shared system, in terms of both access time and throughput.
However, a cluster system based on the conventional art, in which a distributed shared system is divided logically, may from time to time fail to realize fully the potential high availability of the cluster system as described above. This is because the nodes in such a cluster system are connected very densely with each other; in such a dense node connection, an uncorrectable failure that has occurred between nodes during data transfer may propagate in its entirety to other nodes, possibly leading to a failure in many or all of the nodes in the system.
In Japanese Patent Laying-Open (Kokai) No. 2001-7893, an art to resolve the problem of a failure propagating between nodes in a cluster system using a logically divided distributed shared system is described. This art features an enhanced ECC (Error-Correcting-Code) circuit used in the system controlling part, which is provided with a capability to replace a send data to another node with “0” fixed value+ECC during 2-bit error detection in addition to a function for 1-bit error detection, 1-bit error correction, and 2-bit error detection. This art also ensures that the sum adding function of the cluster driver will always calculate a sum for data check, write the resulting sum into the shared memory of the own node, and add the sum to the send data to another node. Finally, the sum check function of this art is designed to always check the sum for data check contained in the receive data that has been read from the shared memory of the other node.
In the art described above, a remote memory read used for data transfer between the nodes in the cluster is executed by a cluster driver program running on the target node, which issues on the processor located in the own node a LOAD instruction from the memory space of the source node.
In a commonly used processor, following the execution of a load instruction by the program, timer-based monitoring is conducted from when the resulting data read is output to outside the processor as a read request until the target data is returned to the processor. If for some reasons no replay data has been returned in response to the executed load instruction and the timer detects a timeout condition, this may develop into an OS panic or other serious situation, preventing further operation of the entire system.
Otherwise, if the processor does not perform timeout detection, the non-returning of reply data may possibly cause the operation of the processor to stall.
Therefore, even with the art described in the disclosure above, high availability may sometimes not be achieved because if during an inter-node access a remote memory read from the memory of the target node is not responded by a reply data for the read due to a failure encountered on the target node or somewhere along the channel connecting between the two nodes, the source node issuing the read can also be affected by the failure.
In the worst-case scenario, in which all but one node are executing remote memory reads from the memory of the one node and if the one node cannot return the read reply data because of a failure, then this may develop into a complete system down.
For this reason, a cluster system according to this art often cannot achieve the high availability that it was originally designed to achieve.
In Japanese Patent Laying-Open (Kokai) No. Heisei 8-137815, a computer system is described that is designed to prevent the occurrence of a failure while processing a message. In this computer system, the requesting module is provided with a sending part for sending a Synchronize message to the target module if a response to the message it has sent out should time out; a part for discarding a response message to a previous message that has been received before a Synchronization Completed message is received; a synchronization completing processing part for performing the process to complete synchronization upon receiving a Synchronization Completed message. The target module in this computer system is provided with a replying part for replying the requesting module with a Synchronization Completed message upon receiving a Synchronize message.
However, all the parts described above are provided within the processor, as shown in FIG. 2, and several problems attributable to this configuration have been reported. For example, when a Synchronization Completed message arrived during an operation system's startup procedure on the processor, a trouble occurred in the operation system, hampering the processing by the operating system.