1. Field of the Invention
The invention generally relates to the fault tolerance of a system of computer processors. More specifically, the invention relates to a system of computer processors that use protocols which allow the system to continue to operate properly after a number of processors have failed due to a crash (ceased operation) and another number of processors have concurrently failed by acting arbitrarily (byzantine failure).
2. Description of the Prior Art
There are many computer system applications which require fault tolerant systems, i.e., computer systems that operate properly even after failure of one or more processors in the system. These applications, often performing critical monitoring and control functions, include: air traffic control systems, nuclear reactor control systems, telelphone switching systems, aircraft and spacecraft control systems, financial funds-transfer systems, and "Wall Street" securities trading systems.
The prior art commonly uses multiple processors in these computer systems to provide a level of tolerance to failures. Often these processors perform the same function, i.e., the processors replicate one another. Multiple processor systems can tolerate the failure of one or more processors by relying on the remaining non-faulty processor(s) which replicate and perform the functions of the failed processor(s). For example, in an aircraft flight-control system, it may be necessary for a processor to acquire a signal from a sensor, use this signal to determine how to change the position of a flight-control surface (e.g., a rudder), and finally issue a signal to actually change the position. If only a single processor performed this function and if this processor were to fail, safe operation of the aircraft would be compromised. However, replicating these functions in multiple processors enhances aircraft safety because some number of non-faulty processors are likely to be available for proper control even if other processors fail.
Fault tolerant systems use computer programs called protocols to ensure that the system will operate properly even if there are individual processor failures. One fault tolerant computer system design uses many replicated processors and two types of protocols together. These two protocols are: 1. broadcast protocols, i.e., fault tolerant broadcasting of a signal to all the processors in the system and 2. consensus protocols, i.e., fault tolerant ways of reaching a consensus. In essence, all the non-faulty processors first determine identical values for system inputs by having the inputs disseminated by a broadcast protocol. Then all the processors perform whatever calculation is required on the inputs in order to individually purpose an output action. Finally, all run a consensus protocol so that the non-faulty processors agree on a common output action.
With respect to the previous example of an aircraft flight-control system, a broadcast protocol could be used to ensure that the same sensor output was attained as input for use by each of the replicated processors in the computer system. The replicated processors could then perform whatever calculation was necessary, based on the sensor output, to determine a direction in which to move a flight-control surface. A consensus protocol could then be used so that all non-faulty processors agreed on the direction to move the flight-control surface.
The number of processor failures that a broadcast or consensus protocol can tolerate is affected by how many processors fail and the mode of processor failure. Processors fail when they no longer properly perform the functions that they where designed to perform. There are two failure modes: crash failures and byzantine failures. A processor has a crash failure, the first mode, if the processor performs its design function up to some point in time and thereafter completely stops working. A processor has a byzantine failure, the second mode, if the processor continues to operate but is not properly performing its design function. The behavior of a byzantine processor is totally arbitrary and unconstrained. At different times a byzantine processor may: 1. perform its design function, 2. crash (perform nothing), or 3. work in an erroneous or arbitrary manner (perform but not properly), e.g., it renders a false result for a calculation.
A fault-tolerant broadcast protocol has as its purpose the reliable dissemination of a signal generated by one "broadcasting" processor (or sensor), i.e., a broadcaster, to a set of other processors even though some system processors have failed. Essentially, the non-faulty receiving processors of the system agree on, i.e., determine, what signal the broadcaster sent. Broadcast protocols are necessary because the broadcasting processor may send a signal to only a single processor at a time and the broadcaster could fail prior to having sent the signal to each desired processor. In the event of broadcaster failure, some processors in the system: 1. have determined the signal while others have not (in the case that the broadcaster fails by crashing), or 2. have determined different signal values (in the case that the broadcaster fails by acting arbitrarily). A fault-tolerant broadcast protocol ensures that all non-faulty processors eventually determine the identical signal value, and, in the case that the broadcasting processor is non-faulty, that the determined signal value is the one generated by the broadcaster.
A fault-tolerant consensus protocol enables each processor to propose an action (via a signal) that is required to be coordinated with all other processors in the system. A fault-tolerant consensus protocol has as its purpose the reaching of a "consensus" on a common action (e.g., turning a switch off) to be taken by all non-faulty processors and ultimately the system. Consensus protocols are necessary because processors may send signals to only a single other processor at a time and a processor failure can cause two processors to disagree on the signal sent by a third failed processor. In spite of these difficulties, a fault-tolerant consensus protocol ensures that all non-faulty processors agree on a common action and that this action is one proposed by a non-faulty processor.
To reach consensus, consensus protocols first enable each processor to propose an action (via a signal) that is later to be coordinated by all the processors in the system. The system then goes through the steps of the consensus protocol. After completing the consensus protocol steps, the common action of the consensus is determined. For example, in a flight-control system, there may be several processors, each equipped with its own sensor, that perform a calculation determining whether the aircraft needs to be moved up or down. In marginal situations, some processors may propose that the craft move up while others propose that it move down. It is important that all non-faulty processors reach consensus on the direction and therefore act in concert in moving the craft.
Prior art discloses that if computer systems using a broadcast and consensus protocol have enough processors, they can tolerate a number of failures solely in the crash mode. According to the prior art, for a system of n processors to tolerate up to t failures of the crash type, it is necessary that n&gt;t, i.e., there must be at least one more processor in the system then there are crash failed processors. In other words, if all the replicated processors in the system fail except one, the system will tolerate these failures because the single working processor can still perform the function of the system. For example, in order to tolerate up to 2 processor failures of the crash type, a computer system utilizing 3 processors may suffice. This is because the broadcast and consensus protocols are able to operate under such conditions and at least one non-faulty processor is always operational and available to undertake the necessary computation and output actions.
These systems and their protocols do not tolerate any number of byzantine processor failures concurrent with the crash failures.
Other prior art discloses computer systems, with a sufficient number of processors, that can tolerate a number of faulty processors which have failed solely in the byzantine mode. These systems require a given number of processors, n, in excess of three times the number of byzantine processor failures to be tolerated, i.e., for a system of n processors to tolerate up to t failures of the byzantine type, it is necessary that n&gt;3t. For example, in order to tolerate up to 2 processor failures of the Byzantine type, a computer system utilizing 7 processors may suffice. This is because the broadcast and consensus protocols are able to operate under such conditions and, if the output action is determined by having all processors reach consensus, an identical output action is performed by all non-faulty processors, whose number, which is at least 5 in this example, exceeds the number of byzantine processors, which is at most 2 in this example. Thus the majority of processors perform identical actions.
These systems and their protocols can tolerate up to t byzantine failures, some or all of which can be byzantine crash type failures, but these systems require more than 3t processors to operate.
3. Problems with the Prior Art
The limitations of the prior art leave designers of fault tolerant computer systems with a dilemma--systems tolerant of the larger but less common class of failures (byzantine) require more processors (and expense) than systems tolerant of the smaller but more common class of failure (crashes). A system designer can make a fault tolerant system that can only tolerate t crash failures (and no byzantine failures) by designing a system with a minimum of t+1 processors. While this system will tolerate the most common failures, i.e., crash failures, just one processor failing in the byzantine mode could cause a total system malfunction with catastrophic results. Alternatively, a designer could build a fault tolerant system which tolerates t byzantine failures with a minimum of 3t+1 processors in the system. In this case, the designer has likely added many more processors to the design to attain a system which tolerates the least probable processor failure. Costs for systems of this sort could be prohibitive, especially if each processor is a large computer system.
Accordingly, there has been a long felt need in the industry for a fault tolerant computer system design that can tolerate (is resilient to) concurrent crash and byzantine processor failures but that does not require a large number of processors. There is no prior art known to the inventors that can tolerate both crash and byzantine failures in a fault tolerant computer system with fewer than 3t+1 processors.