The present invention relates to a fault-tolerant multiprocessor system, and, in particular, to a multiprocessor system whose fault tolerance is based in hardware rather than software. The present invention makes use of Field-Programmable Gate Arrays ("FPGAs") to improve fault tolerance provided in hardware.
Computer systems can fail in any number of ways. The failure can come from a fault in the electronic hardware or a bug in the software. To insure that the computer system continues to function in spite of an individual failure, such as the failure of an individual processor, one builds a fault-tolerant computer. Engineering fault tolerance into a computer generally requires that one replicate a processor or process with redundant components. That is, one has more than one component performing each function and a means, when the fault is detected, for locking the faulty component out of the process and, if necessary, shifting its function to another component.
Thus replicating processors is a straightforward method for contending with a range of system failures. Designers can add redundancy and implement fault tolerance with commercial, off-the-shelf processors, thereby avoiding the expense of designing fault tolerance into the processors themselves. Suppose we have two identical processors executing identical software. We can detect a fault with a bit-wise comparison of the redundant outputs.
Suppose it's the software that fails. Software faults are design faults. To compensate for such faults, redundant routines are designed to be functionally equivalent but different in their instructions. Their outputs may thus each be correct even though they are not identical. Direct comparison of the outputs is therefore inconclusive. Instead one must consider allowable variations in their outputs. These variations are unique to each function, so resolving redundant outputs to produce a single, fault-free output is much more difficult for software faults than for hardware faults.
One can design the hardware so that it accelerates the remedying of software faults. So, wherever the fault occurs in the computer system, it can be remedied without fail in hardware, software, or both, thereby insuring that the system is fault-tolerant, continuing to function without error in spite of the fault.
Many systems require fault tolerance only at certain times. And, even when it is required, the degree of fault tolerance can vary. Thus, instead of fixedly configuring processors for fault tolerance, one can develop flexible structures that maximize the use of processors. Where fault tolerance is not required, this flexibility can be translated into multiprocessing, where the available processors form either a single or multiple parallel machine(s).
Up to now, fault tolerance has not been implemented seriously in hardware. Current fault-tolerant digital computing systems based in hardware are designed with redundant modules, so that failure of a single module does not mean failure of the system. Such designs require unacceptable tradeoffs as fault tolerance is implemented. They carry an excessive overhead in the redundant modules that come into play only when a fault occurs. When the system exhibits no faults, the redundant modules do not contribute to its functioning.
Though prior-art hardware implementations may offer the fastest solutions, they are inflexible in their use of redundant resources. Thus current hardware implementations of fault tolerance are wasteful when applications do not require that each and every module in a system be reliable.
Prior-art software implementations add flexibility, but they introduce other limitations. Multiprocessors can configure their processors for fault-tolerant operation by distributing a "vote" among them. That is, each component offers its own solution, and the entire processor is structured so that a composite, or vote, of them all yields a correct result. In shared-bus multiprocessors, the serial nature of the bus impedes the voting process. Fully connecting the processors is a solution, but multiple connections complicate each processor's interface. In either case, however, when comparison and error detection for fault tolerance take place in software running on the processors themselves, then either fault tolerance must be added internally to the processors or assumptions must be made that severely restrict the types of faults tolerated.
Unlike hardware, however, software offers lower performance, because microprocessors that execute the software commands have fundamental limitations. A microprocessor is inherently serial, that is, it processes only one instruction at a time. A microprocessor's resources are limited, designed years in advance and fabricated into unchangeable silicon. A microprocessor can waste its resources, performing, e.g., only a single add per cycle while the rest of the logic circuitry sits idle, awaiting the result. Software implementations of fault tolerance may allow the most efficient use of redundant resources, but they do so only with considerable overhead.
The problem is threefold. For detecting hardware faults by output comparison of redundant computing modules, the underlying mechanism can be hardware or software based. Hardware-based mechanisms are fast, but the configuration of the modules is rigid. Software-based mechanisms permit flexible module configurations, but performance is slower. For detecting software faults among functionally redundant but differently designed software, the underlying mechanism must accommodate a multitude of programmer-created functions and allow variations between each redundant function. Because of size and power constraints, this complexity has prohibited a hardware-based mechanism for detecting software faults. As a result, software-based mechanisms have been the general rule for detecting software faults, and the speed advantages of hardware-based mechanisms have not been realized.
Thus there exists a need for a hardware-based fault-tolerant digital computing system that overcomes the drawbacks of current systems while preserving the speed advantages of hardware-based over software-based mechanisms for fault tolerance.