A cluster is a set of interconnected computer system servers arranged as nodes that provide access to resources such as application programs. One reason to have a server cluster is that multiple linked computer systems significantly improve computing availability. For example, if one node fails, its resources may failover to other surviving nodes, where in general, failover means that the other nodes provide services that correspond to those that were previously provided by the now-failed node.
Existing communications protocols used in clusters, (such as the global update protocol, or GLUP) allow all healthy nodes to process a set of messages in a predictable order (such as FIFO) despite node or communication failures. In general, GLUP can be utilized to replicate data to the nodes of a cluster; a more detailed discussion of the GLUP protocol is described in the publication entitled “Tandem Systems Review” Volume 1, Number 2, June, 1985 pp. 74-84. Problems arise, however, when failures occur and trigger recovery actions and the like before the messages are processed. For example, consider an example in which a node sends a message to a locker node to request a lock, receives the lock in response to the request, multicasts the request to other nodes, receives proper responses and then sends a message releasing the lock. Then, in this example, this node fails. In certain situations, the locker node can detect the failure before processing the message that released the lock. Because the locker node can only assume that the lock had not been released by the failed node, and cannot allow the failed node to hold it indefinitely, the locker node frees the lock, assuming the node is dead. If the locker node again gives out the lock, (which may be to the same node after having been restarted), and then later processes the release message (which may unpredictably appear), the lock is improperly released, potentially causing significant problems in the cluster.
Similar problems arise with resources running on nodes, where duplicate failure notifications can occur. Resources may be failed and restarted before the second failure notification is processed, resulting in another restart even though the resource was working properly; the delay in processing the second failure message causes the problem. What is needed is a way for cluster software to distinguish between delayed messages that no longer have meaning (i.e., they are stale) and relevant /current messages that need to be acted upon.