As the use of computer networks in safety-critical systems increases (e.g., fly-by-wire commercial aircraft systems, automated train and subway operation systems, airline traffic control systems, medical systems, banking and stock market systems, etc.) the demand for fault-tolerant and highly-available distributed systems also rapidly increases. A crucial component of any such system is reliable communication. As such, there is a urgent need for a method and apparatus that can monitor network systems with process crashes, message losses, and network partitioning. Furthermore, such a method or apparatus could also be used to provide efficient solutions of several other core components of fault-tolerant distributed systems, such as Consensus, Atomic Broadcast, and Atomic Commitment.
Computer networks are often subject to failure. Typically these failures include crashes of processors and processes, message losses, and network partitioning. With such failures, achieving reliable communication is difficult and expensive. The following simple example illustrates the basic problem.
Suppose process x sends a message m to process y, but x does not receive the acknowledgment ack(m) indicating that y received m. Should x keep on trying to send copies of m to y, or should it give up? The answer depends on why x is not receiving ack(m) from y. It could be either because (a) y has crashed or is partitioned away from x, or because (b) the copies of m or ack(m) sent so far were lost due to transient link failures. In the first case x should not continue resending m (because it is useless and thus a waste of network resources), and in the second case, x should try to resend m again. The problem is that x cannot distinguish between the two cases.
To deal with this problem, existing communication protocols either use timeouts or limit the number of allowed retransmissions. In the first scheme, x keeps on resending copies of m until its timeout period expires, and in the second scheme, it resends m until it reaches the maximum allowed number of retransmissions. If processor x gets no acknowledgments by then, it gives up sending m to y. This type of solution is problematic for two reasons. First, the selection of an adequate timeout period, or of the maximum number of retransmissions, is difficult because it depends on many system parameters that may change over time. Second, x sends several copies of m toy even if y is actually crashed (or partitioned away from x)xe2x80x94a waste of network bandwidth.
The Heartbeat Failure Detector is a new, simple method and device that solves the above problem. This device has several identical modules, and each module is attached to a processor in the system. Roughly speaking, the module of a processor x maintains a heartbeat counter for every other process. For each process y, the counter of y at x (called the heartbeat of y at x) periodically increases while y is alive and in the same network partition, and the counter stops increasing after y crashes or becomes partitioned away. Using such a device, x can solve the communication problem above by resending m only if the heartbeat of y at x increases. Note that if y is crashed (or partitioned away from x), its heartbeat at x stops almost immediately, and so x will also immediately stop sending copies of m to y.
An important feature of the Heartbeat Failure Detector is that it can easily be implemented without using timeouts. The detailed specification shows how to implement the Heartbeat Failure Detector in several types of networks where processes may crash and links may lose messages. In particular, we consider networks where all communication links are bidirectional, networks where some links are unidirectional, and networks that may partition. The basic ideas behind these implementations are as follows. For networks that do not partition, processors (processes) periodically broadcast xe2x80x9cheartbeatxe2x80x9d messages: whenever a processor x receives a heartbeat message from a processor y, it increments the heartbeat counter of y at x. For networks that may partition, x increments the heartbeat counter of y each time x receives a heartbeat message that was sent by x and later relayed by y.
Note that the Heartbeat Failure Detector does not use timeouts on the heartbeats of a process in order to determine whether this process has failed or not. The Heartbeat Failure Detector just counts the total number of heartbeats received from each process, and outputs these xe2x80x9crawxe2x80x9d counters without any further processing or interpretation. Thus, the Heartbeat Failure Detector should not be confused with existing implementations of failure detectors (some of which, such as those in Ensemble and Phoenix, have modules that are also called heartbeat). Even though existing failure detectors are also based on the repeated sending of a periodic message, they use timeouts on these messages in order to derive lists of processes considered to be up or down; applications can only see these lists. In contrast, the Heartbeat Failure Detector simply counts heartbeats, and shows these counts to applications.
The Heartbeat Detector works with partitionable networks with process crashes and lossy links, and is applicable to the problems of reliable communication and consensus for such networks. For both problems we developed algorithms that are quiescent, i.e., algorithms that eventually stop sending messages. We first tackle the problem of reliable communication for partitionable networks by extending the results of simple point-to-point solutions. In particular, we generalize the specification of the heartbeat failure detector (hereinafter xe2x80x9cHBxe2x80x9d), show how to implement it, and show how to use it to achieve quiescent reliable communication. We also show how to use the HB for consensus for partitionable networks. We first show that, even though this problem can be solved using a natural extension of failure detectorxcx9cS, such solutions are not quiescentxe2x80x94in other words, S alone is not sufficient to achieve a quiescent consensus in partitionable networks. We then solve this problem usingxcx9cS and the quiescent reliable communication primitives that we developed for the HB.
For those skilled in the art who desire further proof of the efficacy of the invention, we provide an Appendix of lemma showing how the invention necessarily performs its disclosed operations.