1. Technical Field
This invention relates to a method and system for resolving a fault in a cluster of multi-homed nodes in a computer, storage or communication system. More specifically, the invention relates to detecting and isolating the fault to determine origination of the fault to enable appropriate failover and repair action.
2. Description of the Prior Art
A node is a computer running single or multiple operating system instances. Each node in a computing environment has a network interface that enables the node to communicate in a local area network. A cluster is a set of one or more nodes coordinating access to a set of shared storage subsystems typically through a storage area network. It is common for a group of nodes to be in communication with a gateway for connection of a local area network to another local area network, a wider intranet, or a global area network. Each network interface and each gateway in a local area network includes an identifying IP address.
It is also known in the art for nodes in a local or wide area network to include two network interfaces, also known as “multi-homed nodes”. The two network interface configuration provides redundant connectivity. Multi-homed nodes possess software that has access to both network interfaces. In the event of a failure associated with one of the network interfaces or the path along the network interface, the communication may switch to the second network interface on the same node, i.e. failover, without interruption or loss of data or of service from the node.
One form of resolving faults in a network is to require a peer node to issue or request a response protocol message on a suspect network interface. This solution attempts to solve the specific problem of validating the network path loss, as well as determining whether the fault is associated with a local or remote interface network. However, this technique relies on a potentially unreliable server on a remote node to issue a ping to the local network interface. This technique only functions under a single fault scenario. Any network fault or software fault affecting the remote node will provide a false conclusion on the local node.
Other solutions include redundant heartbeats and heartbeat channels between nodes, and link failures to resolve network faults. The use of redundant heartbeats and heartbeat channels solves the problem associated with reliable detection of a node loss, but fails in the area of network loss resolution. Similarly, link failures for resolving network faults are limited to provide network failover support, but do not function with an integrated high availability architecture with node and network monitoring and integrated node and network path failover support. In addition, the link failures technique does not have the ability to determine if a network partition has occurred for which the failover requires cluster reformation. Finally, such solutions which are typically provided by network drivers function only in a single subnet network topology.
The prior art methods for reliably detecting and resolving a fault are either in an efficient or unreliable in an integrated high availability architecture or cannot work reliably in a two node cluster. Accordingly, a method and system for reliable and efficient detection and resolution of a fault in an integrated high availability architecture is desired.