This invention relates to interconnected computer systems, and more specifically to interconnected fault-tolerant computer systems which perform distributed processing.
Distributed computing is the design, programming and operation of a system of interconnected computers to achieve some common goal. In a stricter sense, distributed computing allows the processes performed by a given computing platform or set of platforms in the system to be performed in one or more other platforms in a coordinated fashion without affecting the computational results obtained by the processes. Fault-tolerant distributed computing provides for such a system to compensate reliably, accurately and in a timely way, and without manual intervention, for any of a specified range of failures which may occur during the pursuit of that goal.
The successful management of faults, errors and failures in such a computing environment is complex, but it is essential for any application requiring the cooperation of multiple computers on a real-time basis, especially where the application and the system must support or protect human life. Such situations include medical diagnostic and life-support systems, aircraft fly-by-wire systems, banking, finance and stock-trading systems, and spacecraft environment control and repair systems.
Most computer applications require only a single processor and memory, storage space connected exclusively to that processor, and a link to the Internet, with only one user in control. For such systems, no fault-tolerance exists; if the system encounters a hardware fault or a software error, it will either fail the faulty component, terminate the software in error, or crash the system. As any user of a desktop computer will attest, recovery is a manual affair, and may take a long time.
By contrast, fault-tolerant systems must continue acceptable operation under failure conditions. Fault-tolerant systems require some level of hardware redundancy, keeping reserve hardware and software processing components available for use in sufficient numbers and types to handle ongoing and anticipated workload whenever an operating component fails. Since a processing component may fail, multiple processing components are designed into fault-tolerant systems to allow uninterrupted completion of work in progress. The processing components cooperate in the distribution, execution and completion of tasks, and the assembly and distribution of their results.
Failure of a processing component in a fault-tolerant distributed system does not imply that the failed component stops all processing and relinquishes its workload. Some types of failure permit reboot and recovery of the failed component, allowing it to continue with active service. Under conditions of heavy load, this reactivation of a failed component may be essential for the system to continue to deliver its outputs as required.
Unfortunately, such reactivations may not correct the problem causing the failure in the first place, and a reactivated processing component may resume its work, deliver some results, and fail again, more than once. How does the system accommodate such erratic behavior? How can the system make a valid determination as to the state of its processing components, and act accordingly to complete its work in an acceptable manner?
The same question applies whether the processing components in question are hardware components, software components, or a combination or blend of the two types. For the purposes of this discussion, a process and a processor are considered to have similar, even identical, problems of behavior. The terms xe2x80x9cprocessxe2x80x9d and xe2x80x9cprocessorxe2x80x9d are used interchangeably here.
Numerous mechanisms have been constructed in hardware and software to prevent individual component failures from stopping an entire system or rendering its processing ineffective. A critical problem with many such mechanisms is that they cannot reliably detect and identify failures of other system components. A failure message may be lost. A failure-detection component may itself malfunction. A working component may be falsely identified by malfunctioning hardware or software as one that has failed.
Consensus
The best range of solutions to the problem of reliable failure detection falls under the heading of consensus: getting the active, reliable processes in a system to agree on some common decision value, such as whether to commit or abort a transacction, whether or not a given component of the system has failed. Depending on the degree of complexity of failures the system is designed to handle, such consensus may require a high level of redundancy of processes. This requirement adds significant capital and operating cost to the system.
The problem of solving consensus in asynchronous systems with unreliable failure detectors (i.e., failure detectors that make mistakes) was first investigated in [CT96, CHT96]. These works only considered systems where process crashes are permanent and links are reliable (i.e., they do not lose messages). In real systems, however, processes may recover after crashing and links may lose messages. The problem of solving consensus with failure detectors in such systems was first considered in [DFKM96, OGS97, HMR97].
Solving consensus in a system where processes may recover after crashing raises two new problems; one regards the need for stable storage and the other concerns the failure detection requirements.
First, regarding stable storage: when a process crashes, it loses all its local statexe2x80x94the memory of what is going on at the time of the crash. This xe2x80x9cmemory lossxe2x80x9d limits severely the actions a process can take upon its recovery. One way proposed for dealing with this problem is to assume that parts of the local state are recorded into stable storage, and can be restored after each recovery. But since stable storage operations are slow and expensive, they must be avoided as much as possible. Is stable storage always necessary when solving consensus? If not, under which condition(s) can it be completely avoided?
Second, regarding failure detection: in the crash-recovery model, a process may keep on crashing and recovering indefinitely. Such a process is called unstable. How should a failure detector view unstable processes? An unstable process may be as useless to an application as one that permanently crashes, and may in fact be disruptive to overall system operation. For example, an unstable process can be up just long enough to be considered operational by the failure detector, and then crash before xe2x80x9chelpingxe2x80x9d the application; this up-down-up cycle could repeat indefinitely. It would be natural to require that a failure detector satisfy the following completeness property: Eventually every unstable process is permanently suspected.
Implementing such a failure detector is difficult even in a perfectly synchronous systemxe2x80x94one in which the stages of operation are synchronized and timed, and expectations of completion may be set and met. The difficulty is due to the fact that, at any given point in time, no such implementation can predict the future behavior of a process that has crashed in the past but is currently xe2x80x9cupxe2x80x9d. Will this crashed process continue to repeatedly crash and recover, or will it stop crashing?
The problem of solving consensus with failure detectors in systems where processes may recover from crashes was first addressed in [DFKM96], with crash recovery as a form of omission failure. More recently the problem was studied in [OGS97, HMR97]. In these three works, the question of whether stable storage is always necessary was not addressed, and all the algorithms used stable storage. In [DFKM96, OGS97] the entire state of the algorithm is recorded into stable storage at every state transition. In [HMR97], only a small part of the state is recorded, and writing to stable storage is done at most once per round. The algorithm in [DFKM96] is not designed to deal with unstable processes which may intermittently communicate with good ones. The algorithms in [OGS97, HMR97] use failure detectors that require that unstable processes be eventually suspected forever.
This last requirement has a serious drawback: it forces failure detector implementations to behave poorly even in perfectly synchronous systems. A good example is found in the class of synchronous, round-based systems having no message losses. A synchronous system always performs its intended function within a finite and known time bound; a round-based system uses multiple rounds of decision-making among active processes regarding other processes that are suspected of having failed. In such a system, up to a certain maximum number of processes may be unstable.
In such a system, every implementation of a failure detector with the above requirement will execute with the following undesirable behavior: there will inevitably occur a round of execution after which all processes are permanently up, but the failure detector incorrectly suspects the given maximum number of potentially-unstable processes forever. Significantly, these permanent mistakes are not due to the usual causes, namely, slow processes or message delays. They are entirely due to the requirement for suspecting unstable processes. This requirement involves predicting the future.
What are the consequences of requiring that unstable processes be suspected, meaning that they cannot be used? Consider the example of a set of processes receiving and storing telemetry data, in which certain infrequently-occurring streams of bad data can cause the receiving process to fail. In this case, the users of the data consider the receipt and storage of the data more important than the failure of a receiving process. If a prolonged burst or series of bursts of bad data is received, a failure detector obeying the above requirement may take one or more receiving processes out of service, incurring the loss of significant streams of irreplaceable information.
Clearly, if potentially-productive processes in a system are barred from performing their tasks in this manner, the system is wasting resources and possibly losing critical data. On the other hand, if faulty processes are allowed to continue operation without adequate restraint, the products of their execution may be worthless or disruptive, and the system as a whole may crash.
[CHT96] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685-722, July 1996.
[CT96] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225-267, March 1996.
[DFKM96] Danny Dolev, Roy Friedman, Idit Keidar, and Dahlia Malkhi. Failure detectors in omission failure environments. Technical Report 96-1608, Department of Computer Science, Cornell University, Ithaca, N.Y., September 1996.
[HMR97] Michel Hurfin, Achour Mostefaoui, and Michel Raynal. Consensus in asynchronous systems where processes can crash and recover. Technical Report 1144, Institut de Recherche en Informatique et Systxc3xa8mes Alxc3xa9atoires, Universitxc3xa9 de Rennes, November 1997.
[OGS97] Rui Oliveira, Rachid Guerraoui, and Andre Schiper. Consensus in the crash-recover model. Technical Report 97-239, Dxc3xa9partement d""Informatique, École Polytechnique Fxc3xa9dxc3xa9rale, Lausanne, Switzerland, August 1997.
The invention comprises a protocol to provide consensus among processes in a distributed computing system where processes may crash and later recover. The invention also provides consensus among processes which may be disconnected and later reconnected to the network, as can often happen in a mobile environment, where messages can be lost. In support of its consensus protocol, the invention further comprises the augmenting of the output of process-failure detectors with epoch numbers, one per active process, in order to detect both permanent and transient crashes. The invention""s protocol does not require the use of stable storage, thereby reducing system cost and complexity, and works even if more than a majority of the processes have crashed. The invention""s failure detectors and their epoch numbers permit the proper handling of failed processes that suffer from frequent transient crashes. The invention does not remove such processes from service, and thereby improves performance, reliability, stability and productivity of the system. The invention further requires no mechanism for predicting future behavior of the system in which it operates.
The invention""s consensus mechanisms tolerate link failures and are particularly efficient in the runs that are most likely in practicexe2x80x94those with no failures or failure detector mistakes. In such runs, the invention achieves consensus within 3xcex4 time and with 4n messages, where xcex4 is the maximum message delay and n is the number of processes in the system.