1. Field of the Invention
The present invention relates to an error recovery technology in a parallel computer, and in particular to a memory error recovery technology in a cluster computer.
2. Description of the Related Art
One type of parallel computer is a cluster computer, in which a plurality of nodes that include at least one processor and a memory are connected together by a high-speed interconnecting network, such as a crossbar network. One of the advantages of a cluster computer is that the ratio of cost to capacity is superior. For example, while there is added cost for each node, when using a workstation with a high throughput, a ratio of cost to capacity that is almost as great as that of a super-computer can be obtained. In addition, another advantage is that the system is easily expanded in comparison to a parallel computer having central common memory wherein common memory is centrally allocated at one physical location. Furthermore, another advantage is that because each node is independent as one computer under the control of its proper operating system, it is possible to obtain a multi-job processing configuration, for example, executing a different job at different nodes that configure the cluster computer, or executing one job at a plurality of nodes simultaneously as a parallel program. Moreover, Japanese Unexamined Patent Application, First Publication, No. Hei 8-305677, is an example of a citation relating to such a cluster computer.
In addition, there are cluster computers that are distributed common memory parallel computers that allocate local memory to each node, and do not centrally allocate common memory to one physical location. However, because this is one type of common memory computer, the inter-processor communication model follows the common memory model. That is, communication between nodes is realized by the processor of each node directly accessing the common memory by an address command using a conventional memory access operation. Specifically, when a memory access request generated at one node is an access to the memory located at the same node, the memory access request is transferred to the memory of the same node, and the memory access origin is sent the access result. Otherwise, when a memory access request generated at one node is an access to memory located on another node, the memory in the other node is accessed by the memory access request being transferred to the other node through the interconnecting network, the access result being returned to the request origin node through the interconnecting network, and the memory access origin being notified.
The memory within the nodes that configure the cluster computer store important information that cannot be damaged, such as the operating system and other types of application programs that are executed by the node. Thus, memory is used that has an internal ECC (Error Checking and Correction) function that increases reliability. For example, a 1 bit error can be corrected with a Hamming code that adds 7 correction bits to 32 bits.
When a node carries out a memory access of memory with this kind of internal error correction function, a 1 bit error will be automatically corrected, and the memory access ends normally. However, if there is a 2 bit error, the memory access ends abnormally because it is not correctable, and an irrecoverable abnormal stop is returned as the memory access result. Because a hardware fault that results in an irrecoverable error being generated to the main memory forming the computer constitutes a very serious error, in conventional cluster computers, like general use computers, a system shutdown error notification is issued in the node that receives an irrecoverable abnormal stop as a memory access result, all programs being executed at that node are ended, and the system is stopped.
Therefore, when an irrecoverable error is generated in a common communication area that is located in the memory of each node for communication between nodes, the node that accessed this common communication area causes a system shutdown even if the access origin is memory located on another node. Because an original feature of cluster computers is each mode being able to operate independently, when an irrecoverable error is produced in memory not located on the same node, merely by accessing that location, this node will shut down the system, and this situation becomes a factor in severely decreasing the availability of the cluster computer.
Thus, it is an object of the present invention to stop a node that has accessed the common communication area from shutting down the system due to an irrecoverable error produced in the common communication area of memory located on another node, and increase the availability of a cluster computer.
In addition, in the case that a node, such as a kernel of the operating system, continues operating although an irrecoverable error has occurred in the node""s privileged memory that stores necessary information, this node will inevitably shut down the system, and when the irrecoverable error occurs in a common communication area located on that node, it immediately shuts down the system, and this situation is a major factor in causing decreased accessibility of the cluster computer.
Thus, a second object of the present invention is to prevent one node from shutting down the system due to an irrecoverable error occurring in the common communication area of the memory located on that same node, and increase the availability of the cluster computer.
In order to obtain the first object of the present invention, each node in the cluster computer of the present invention sends to the memory access origin a system error stop notification when an irrecoverable error occurs at the time a memory access request in one node is sent to that same node""s privileged area, and sends to the memory access origin a common communication area error notification when the irrecoverable error occurs at the time a memory access request is sent from one node to a common communication area of memory located on another node through the interconnection network.
In this manner, in addition to the conventional system error stop notification, indicating that the system will be immediately stopped because a fatal error has occurred, being sent as a notice that an irrecoverable error has occurred during a memory access, a common communication area error notification is defined that indicates that a minor error has occurred not connected with a system stop. In case an irrecoverable error occurs during a memory access request generated in the same node, if the access destination is the same node""s privileged area, a system error stop notification indicating that severe error has occurred is generated. However, if the access destination is the common communication area of the memory located on another node, a common communication area error notification indicating that a minor error has occurred is sent rather than the system error stop notification. Thereby, it is possible to prevent the system being shut down by a node that accesses the common communication area of memory located on another node due to irrecoverable error occurring in that common communication area, and it is possible to increase the availability of the cluster computer.
In the case that an irrecoverable error occurred in the memory of another node due to a memory access request sent from a given node, in the end a common communication area error notification is sent to the memory access origin, such as the processor of the node that is the request origin, and the following types of method are used to determine where this common communication area error message is generated.
In one method, when a memory access request generated in one node is an access request to the memory of another node, a system control device in each node that carries out control of transferring the request to another node through the interconnection network will generate a common communication area error notification when an irrecoverable abnormal stop is received by the other node through the interconnection network in response to the memory access request, and thus notify the memory access origin.
In another method, the interconnection network that requested the transfer of an irrecoverable abnormal stop in response to the memory access request generates instead a common communication area error notification, and sends it to the node that was the transfer destination, that is, the node that was the access request origin.
In another method, the notification is generated in the node that actually made the memory access. That is, the memory access request that is sent from another node through the interconnection network is sent to the memory of the one node, and when the system control device that returns the access result to the node that is the access origin through the interconnection network receives an irrecoverable abnormal stop from the memory of the one node at the time a memory access request is made from the other node, it issues a common communication area error notification instead, and returns it to the access origin through the interconnection system.
Another method is generating the notification in a service processor that receives an error report from each node and makes an error log. That is, when the service processor receives irrecoverable memory error report from a node, it determines whether or not the error occurred in the common communication area, and in case that it occurred in the common communication area, sends a common communication area error message to the node that is the access origin.
In addition, in order to achieve the second object described above, each node in the cluster computer of the present invention sends to the memory access origin a common communication area error notification when an irrecoverable error occurs at the time a memory access request generated in one node is sent to the common communication area of the memory of the same node.
In this manner, when an irrecoverable error occurs at the time a memory access request in one node is sent to the common communication area of the memory of the same node, a common communication area error notification indicating a minor error occurred is sent rather than a system error stop notification, and thereby it is possible to prevent that node from shutting down the system due to an irrecoverable error in the common communication area of the memory located therein, and thus it is possible to increase the availability of the cluster computer.
The processing at the time of the common communication area error notification is carried out within a range that does not involve a shutdown because the error is minor. For example, in the case that the common communication are is partitioned and defined over a plurality of buffers and it is possible to degenerate a buffer unit when it is damaged, processing to degenerate the buffers of the common communication area where the error occurred will be carried out. In addition, in the cases that the configuration is such that a buffer unit cannot be degenerated, or even when this is possible but there is not one normal buffer, processing to close the communication between nodes using the relevant common communication area will be carried out. Thereby, communication between nodes that communicate through this common communication area will become impossible, and this itself is not fatal to the operation of the cluster computer. The reason is that because each node of the cluster computer can operate as one computer, each of the nodes of the cluster computer can continue executing as long as the job does not require communication with other nodes, and even when one job is being executed as a parallel program simultaneously by a plurality of nodes, it is possible to execute the parallel program at the remaining plurality of nodes while excluding the node that can no longer communicate. Furthermore, in communication between nodes according to the common memory model through an interconnection network, it is possible to substitute communication between nodes using the message exchange model in a cluster computer that can support communication between nodes by the message exchange model through a global network such as Ethernet.
Because the common communication area is common logically between the nodes, when an irrecoverable error occurs in a given common communication area, not only the node of the access origin but the other nodes as well can know about this fact, and it is necessary to take steps, for example, to degenerate buffer units. In the typical method of informing the other nodes about the error in the common communication area, the access origin node transfers the common communication area error notification to other nodes through the interconnection network or the global network, but other than this, during access memory by other nodes, in case an irrecoverable error has occurred, it is possible to use the following efficient methods.
In one method, in the case that an irrecoverable abnormal stop occurs in the memory of one node while another node is making a memory access request, the node of the access request origin is sent the irrecoverable abnormal stop through the interconnection network, and at the same time, the processor that same node is also notified.
In another method, the interconnection network that requested transferring the irrecoverable abnormal stop in response to the memory access request broadcasts the irrecoverable abnormal stop to all nodes.
In another method, the interconnection network that requested the transfer the irrecoverable abnormal stop in response to the memory access request broadcasts a common communication area error notification instead to all nodes including the node that was the transfer origin.
In yet another method, in case that an irrecoverable abnormal stop occurs in the memory of one node at the time of a memory access request from another node, instead of an irrecoverable abnormal stop, when the node of the access origin is sent the common communication area error notification through the interconnection network, the interconnection network broadcasts the common communication area error notification to all nodes.
In another method, the service processor sends a common communication area error notification to all nodes, including the node of the access origin.