In a high-availability cluster, it is very important to be able to distinguish between a remote node which has failed and a network partitioning event since these events are handled in very different ways. Unfortunately, a network partition presents the same set of symptoms to system monitoring routines as does a remote node failure. In particular, while it is safe to take over the resources from a failed node, it is not safe to do so in the case of a node which appears to have failed because of a network partition.
Before discussing the problems solved by the present invention, it is worth describing the context and environment in which the invention is most advantageously employed. In particular, the present invention operates in a multinode data processing system. An exemplar system of this kind is found in the pSeries of data processing machines manufactured and sold by International Business Machines, Inc., the assignee of the present invention. Exemplary systems comprise a plurality of independent data processing nodes, each having its own independent memory and its own operating system software (though it is noted that in typical configurations of nodes whose efforts are plied against a particular problem, it is more often the case that the same operating system is employed on all nodes assigned). These systems operate in both parallel and distributed fashions to accomplish well-defined, user-specified data processing goals. Exemplar machines of this type typically communicate through the exchange of messages, perhaps through a publicly defined protocol such as IBM's Message Passing Interface (MPI). In order to solve specific problems or to provide higher availability, the nodes are organizable into clusters of select nodes. This organization can be orchestrated in the pSeries of machines through a program product called HACMP (High Availability Cluster Multiprocessing). HACMP provides the ability for applications to continue to run even in the presence of node, network and/or adapter failure. It is also noted that some or all of the nodes present may, from time to time, join and depart from various node groupings. In the event of node failure, HACMP provides a mechanism in which the work of a failed node may be picked up by other nodes. However, it is important that the user's application work proceed, especially in those cases where the user's application consumes lengthy amounts of data processing time and/or in those cases in which restarting is impractical. Accordingly, group membership is fluid. This fluidity allows data processing work to continue independently of the actual node on which the work is being done. It is therefore seen that fluidity in group membership provides a powerful advantage in distributed and parallel processing environments; however, as discussed above, the fluidity provided via network partition can also appear to the monitoring system (HACMP, for example) that a node failure has occurred. The problem then becomes how to distinguish such an event, which may be caused by a network failure or a failure in all of the node's network adapters, from an actual node failure.
Several techniques are usable to assist in providing an indication of between these two cases. In particular, node liveness checks may be carried out using a shared disk path which provides a backup communication path for the case where the main communication networks fail. Main communication paths include the use of redundant IP-based (Internet Protocol) networks and also non-IP networks such as RS-232 ports and lines.
This present invention employs the use of a shared disk as a liveness determination medium. Techniques are introduced to make a shared disk behave as a communications network from the point of view of a liveness determination subsystem in a distributed cluster. Other cluster subsystems use a shared disk as “quorum device” or as a backup communication substrate, but none of them use the disk as part of a “heartbeating network.” Using the disk as a quorum device works, but forces both sides of the “partition” to go through their recovery procedures, while “disk heartbeating” allows the two sides of the partition to keep working as before, since each side will know that the other side is alive.