Computer networks typically include many components (also called nodes) and many resources associated with those nodes. The term "clustering" refers to a particular kind of computer network system. A cluster is a group of computers (nodes) connected in a way that lets them work as a single, continuously available system. Improved availability of system resources is one of the greatest advantages of clustering. Most resources in a cluster system are supplied redundantly, thus making the resources more available overall. Redundant resources permit the cluster system to continue working whenever one or more components of a cluster, whether hardware or software, fail. When a component failure occurs, the system continues by switching to another, operational component.
In a cluster system (or any large computer network), there are certain to be failures of various resources and components over time. Many computer systems include a model of the system that includes a "dependency graph," which defines which resource depends on which other resources. When serious failure of a resource is detected, the network removes the resource from its model of available resources and makes the resource unavailable to nodes in the network.
If, for example, a client node tries, but is unable to access a World Wide Web server, a resource failure of the web server has occurred. A failure diagnosis procedure then determines whether the failure is serious enough to make the web server unavailable to all other nodes in the network as well as the root cause of the failure. Part of the problem with current failure diagnosis system is due to the existence of "transient failures" in a network. For example, a World Wide Web server may be overloaded and occasionally fail to service requests for certain data. This qualifies as a transient failure, since the Web server is not permanently disabled, but only temporarily overloaded. It would not be a good solution to remove the Web server from service every time it fails to deliver data to a requesting node. On the other hand, if the Web server is consistently failing to deliver data, it is probably desirable to remove the Web server from the network.
Thus, a common problem involved in conventional failure diagnosis lies in whether to take conservative or aggressive action when it is determined that a resource failure has occurred. If a too-conservative approach is used, some total resource failures are missed, reducing overall system efficiency and availability. In contrast, if a too-aggressive approach is used, partial or temporary resource failures are sometimes mis-diagnosed as total failures, again reducing the efficiency and availability of the system.