Clusters of servers or nodes are frequently used to deliver network services. In that regard, the clusters manage resources that provide services. In response to a failure of a node in the cluster, the cluster must ensure that the service is stopped on the failed node before migrating the service to another node. One approach for accomplishing this is for the cluster to issue a stop command to the failed node; however, this approach often requires the cluster management to wait a period of time, e.g., a “timeout,” to be certain that a failure has actually occurred. Upon expiration of the timeout period, node failure can be inferred, at which point cluster management can safely shut down, re-power, and re-start the failed node so that it is eventually prepared to provide services. Clearly, this process can require a fairly significant amount of time.
The foregoing situation is further complicated by the fact that certain resources running on a node may have dependencies that affect the order in which resources must be shut down. Clearly, the need to properly sequence the shut down of the resources increases the overall time required to shut down all resources and thereby increases the amount of time needed before the cluster management can infer that the resources have been successfully shut down.
The foregoing situation is still further complicated by the fact that services may be provided by clusters that are independent and geographically dispersed. Due to the nature of independent and geographically dispersed clusters, inference and/or confirmation of node failures at other geographical locations encounter similar problems as described above. For example, in the case of a link failure at a site, confirmation of termination of resources at that site can not be forwarded to another site. Therefore, waiting for a timeout period is a prudent approach for confirming termination of resources. In particular, a cluster at the other site can infer the failure of a node at the site with the link failure based on timeouts. In such cases, the site inferring the failure of a node must be able to assume that N seconds after the other site has been detected as dead, all resources on that other site have been stopped. This takes a certain minimum amount of time, even after the failed node has been appropriately shut down.
In yet another scenario, for delivery of a particular service, a resource in one cluster may depend on another resource in another independent cluster. In this situation, the dependencies of resources in different independent and geographically dispersed nodes increases the amount of time required to accurately confirm proper shut down of the failed nodes. Therefore, although the approach discussed above has been generally adequate for its intended purposes, it has not been entirely satisfactory in all respects.