In a multiple node computer, such as BlueGene/L, the ability to isolate faulty hardware is essential. For example, a chip may be operating near the edge of its acceptable environmental limits, for example, clock frequency, voltage or temperature. A temporary change in one of these environmental factors may cause the node to compute an incorrect value, for example, when performing a floating point operation. Such an incorrect value is an undetected error, or fault, and can cause the results of the entire operation to be invalid. In many cases the error is transient, and does not repeat when the calculation is re-run. These types of errors can be extremely difficult to find. Further, a bad calculation on one node can quickly propagate, for example, through message passing, to other nodes in a massively parallel computer, masking the original source of the error. In such cases, it is extremely difficult to identify the faulty node.
Diagnostic hardware tests can frequently be run to detect such faults (by comparing computed results to known correct values), however they may stress the hardware in different ways than real applications. Further, diagnostic hardware tests cannot easily find and isolate a transient error, and they may not be able to find the source of a propagating error.
Checksums are routinely used for fault identification, such as in TCP in which a message spanning multiple packets is checksummed. The checksum is usually sent at the end of a message. The receiver of the message computes the checksum as the message arrives and compares its computed value to the value transmitted by the sender. If a difference occurs, the message is known to be in error and can be retransmitted. However, this only identifies faulty message transmission and does not identify whether or not bad data is sent as part of the message due to a faulty computation.
Triple modular redundancy (TMR) uses extra hardware and comparators to compare the results of the same computation done by redundant hardware components. A voting mechanism is used to determine which of the components are correct, and to isolate out faulty components. However, this is a much more costly solution (in terms of hardware) than injection checksums.
Thus, those skilled in the art desire methods and apparatus for identifying and isolating node faults in multiple node computing systems, in particular node faults which may be of a transient, or non-repeating, nature. In contrast to methods of the prior art that use fault diagnostic programs not operable during execution of actual application programs, those skilled in the art desire fault detection methods and apparatus that operate during execution of application programs. In such methods and apparatus there would be no question as to whether a fault diagnostic program would successfully identify a node likely to fail during execution of an application program, since the methods and apparatus of such a system would perform fault identification using actual runs of the application program. Thus, fault conditions created by combinations of factors only encountered during execution of an application program would be detected.
In addition, those skilled in the art desire methods and apparatus for identifying and isolating faulty nodes in multiple node computing systems that can source the initial fault condition to the node or nodes which generated it. Often, methods and apparatus of the prior art do not take the architecture of a multiple node computing system into consideration and are, therefore, incapable of identifying with particularity which node or nodes of the system originated the fault condition.
Further, those skilled in the art desire methods and apparatus for identifying and isolating faulty nodes in multiple node computing systems that are capable of identifying which portion of an application program resulted in a fault condition when executed. Methods and apparatus incapable of making such identification are less useful as diagnostic tools.
Finally, those skilled in the art desire methods and apparatus for identifying and isolating faulty nodes in multiple node computing systems that are flexible, inexpensive, and can be adapted for use in combination with many different application programs. Ideally, the methods would be of such universal applicability and ease of use that they can be applied during creation of application programs. Such methods and apparatus would not require the creation of separate fault detection routines in a costly and expensive separate software authoring step. Rather, the fault detection steps could be incorporated into the application program itself.