This invention relates to the detection of failures, such as communication failures, in a fault-tolerant computing system.
Redundant hardware elements are commonly used in fault-tolerant computing systems. Individual elements of the system typically attempt to detect faults by monitoring signals generated by other elements in the system or generated externally to the system.
In addition, an element of the system may periodically transmit a so-called "heartbeat" signal that indicates proper operation of the element. If the heartbeat signal is not received by another element in the system, the receiving element can suspect that the transmitting element is not operational. However, failure to receive a heartbeat signal also may result from a fault in the communication path between the two elements. In general, fault handling should distinguish between a fault in an element of the system and a fault in the communication path between elements.
Redundant network interface controllers (NICs) are used in fault-tolerant computing systems to provide reliable, uninterrupted communication with an external network. In general, one NIC operates in a primary, or active, mode in which the NIC is responsible for communication with other devices on the network, while the other NIC operates in a standby mode.
In operation, the NICs can exchange heartbeat messages to detect failures in a path from one NIC through the external network and back to another NIC. A failure in the path between NICs can occur at several points, including the input or output stages of the NICs, the transmitting or receiving connections between the NICs and the external network, or in the external network itself. The point of connection to the external network is generally at a port of a network hub, with the hub being connected to multiple network devices. Each NIC may be connected to a different hub in the external network to avoid having a single hub become a critical point of failure.