One of the challenges in building distributed systems is to avoid situations where one part of a system remains blissfully ignorant of important failure conditions that are occurring elsewhere in the system. Applications running on nodes in the system rely on one another for an application state, such as a piece of data, a resource, a variable, an operating condition, etc. Therefore, ignorance of a failure in the system can result in both inaccurate behavior and an orphaned state. For example, consider Nodes A, B, and C in a distributed system. Applications running on Nodes B and C depend upon Node A for a particular application state, such as the current temperature T. If Node A fails, or a communication link between Nodes A and B or A and C fails, the application state is no longer valid. If Nodes B and C do not know that Node A failed, they assume their current value for T is valid. However, when the actual T changes, the applications on Nodes B and C using the invalid T will produce erroneous results. Accordingly, there is a need for a failure detection and notification service to inform nodes of failures in the system.
Failure detection in distributed computer systems is difficult. Foundational work on distributed systems showed that it is generally impossible to distinguish between a remote computer having crashed, a remote computer running very slowly, the network being down, and several other failure scenarios. Because of this, failure detection services cannot perfectly report all failures and only report failures under some circumstances.
Previous failure detection services have been used in distributed computing environments that attempt to achieve reliability and availability by running the same program on several computers in parallel. In these systems, every input is sent to all of the computers. In this context, which is sometimes referred to as “lock-step replication” or “virtual synchrony,” each of the several computers receives all of the inputs, does some computation, and (typically) sends some output back to the user. The user then aggregates the responses, perhaps by taking as definitive the response that appeared most often (if the responses happen to be non-identical). Thus, it is often necessary for each of the several computers to agree about the identity of all the other computers in the group. The role of the failure detection service is then to detect computers that have failed, and to propagate this information to all the members of the group. The failure detection service is generally tightly integrated with a group membership service; the group membership service is the local service each computer runs that is authoritative on the question of which computers are available to participate in the distributed computing environment (possibly from joining in new computers to replace computers that are believed to have failed). These failure detection services are generally not suitable for handling large numbers of machines simultaneously, and they generally provide reliable failure notification contingent on the continuing operation of a reliable messaging substrate.
Another failure detection service seeks to ensure that most computers agree about which other computers are functioning in the face of some failures, but not all failures. For example, the failure notification service detects only computers that have become entirely unreachable, and does not detect communication failures that prevent only certain pairs of computes from communicating. Furthermore, the failure notification service does not support the establishment of multiple small groups, and requires that all computers that are participating in the failure detection service to be aware of all other computers that are similarly participating.
There exists, therefore, a need in the art for a lightweight, distributed failure notification service that allows for the formation of failure notification groups, and guarantees that every computer in the failure notification group will be reliably notified of a system failure affecting the group.