I. Field of the Invention
This invention relates generally to an improved fault-tolerant digital computer architecture, particularly for applications where a very high degree of safety is required, e.g., an aircraft flight control computer wherein the safe operation of the aircraft depends upon continuous error-free computer operation for the entire period of flight. It is recognized that error-free operation requires the elimination, or containment, of both software and hardware faults, however the scope of this invention is limited to an improvement in the fault-tolerance of hardware only.
II. Discussion of the Prior Art
For the purpose of understanding the invention, it can be assumed that a malfunction of any single component in a conventional computer will result in an unsafe error. This is known as a series reliability model, where the probability of an unsafe error is the sum of the probability of the malfunction of each component. This series reliability model is expressed in the observation that "a chain is only as strong as its weakest link" and a system corresponding to this model is typically referred to in the literature as a "single thread system". A significant body of art, known as fault-tolerant computer architecture, has developed from the recognition that the best efforts to build a reliable, single-thread system are totally inadequate for many applications.
Underlying all fault-tolerant architectures is the concept that the effects of isolated faults can be masked by replicating the elements of a computer and coupling them in a redundant arrangement such that results are determined solely by a subset of functioning elements, i.e., the system produces error-free results even though one or more elements are malfunctioning. This is a much more difficult task than merely braiding a strong rope from a set of individually weak strands. To achieve fault masking, it is necessary to systematically correct errors when they occur or, alternatively, to exclude the faulty element from participating in the generation of a result. Either action depends upon an automatic means of error detection coupled to control circuitry which either corrects or contains the fault. This problem of realizing a fault tolerant system is further compounded by the question of: "What checks the error checker and the control circuits?".
It is the goal of all fault tolerant architectures to provide the greatest possible reliability improvement with the lowest possible degree of redundancy since redundancy increases cost, power, and size. In some instances, the added redundancy actually undercuts the reliability improvement being sought. The reliability improvement can be directed toward improving the availability of the system (the percentage of time the system is available to do useful work) or the safety of the system (the probability that the system will perform error-free for a specified mission time). Although availability and safety are interrelated, this invention is directed to achieving a substantial improvement in safety with a lower degree of redundancy than what has heretofore been disclosed in the prior art. The present invention is distinguished over prior art in that error correction capability, which would improve availability, is sacrificed to achieve a higher degree of safety.
It is well known in the prior art to employ redundancy in the form of error checking bits to make memories fault tolerant. This technique employs a linear block code (also known as an n,k code) comprised of a set of n binary digits wherein a subset of k binary digits represent the message (or data) portion of the code and the remaining binary digits (n-k) represent redundant binary digits of the code which may be used for error detection and/or error correction. A specific instance of a given code is commonly called a "code vector". For example, a 9,8 code (8 data bits and one error checking bit) can generate 512 unique nine-bit code vectors (2 raised to the ninth power). A 9,8 code provides the simple parity check of an 8 bit word which is capable of detecting a single bit error, but would miss the detection of any even number of bits in error and provides no capability to correct errors. As the number of error checking bits is increased, the capability of the code to detect and/or correct random errors improves. This is because as the number of check bits increases, the fraction of all possible code vectors, which are valid code vectors, decreases, thus increasing the probability that a given error will result in an invalid code vector and thus be detectable. The so-called Hamming weight of a given linear block code is the measure of its error detecting capability, i.e., the Hamming weight is the maximum number of places (binary digits) a given message (data) may be in error and still assure error detection. When the number of places in error exceeds the Hamming weight there is the possibility that the error will transform the code vector into a different, but valid and therefore undetectable, code vector. The logical properties of the code generator, usually expressed in the form of a code matrix, determine the specific error detection and error correction capabilities of the code. For any linear block code, the set of errors which can be detected is larger than the set of errors which can be corrected. Further, error detection capability can be enhanced at the expense of reduced ease of error correction. A detailed discussion regarding the properties of linear block codes is provided by the text titled "Error Control Coding: Fundamentals and Applications", Shu Lin and Daniel J. Costello, Jr., Prentice-Hall.
Linear block codes are well suited for memory error management. When writing to the memory, a code generator may be used to generate the error checking bits based upon the data bits provided as inputs. The code vector (data bits plus the error checking bits) is written as a single word in memory. When reading from the memory, the code vector is presented to a "syndrome generator" which provides outputs that may be used to provide an indication of error or, alternatively, be used to correct certain errors. Well chosen linear block codes having a Hamming distance of 3 or more can provide a very low probability of undetected errors given the nature of memory device failure modes.
Linear block codes are not generally a practical means of error management for the central processing unit (CPU) of a computer since the error correcting capability of the code is lost for any arithmetic transformation of the data. One well known prior art technique for CPU error management is "triple modular redundancy" (TMR). TMR employs three CPUs which execute the identical program, using identical data in clock synchronism. A majority voting circuit determines the output result for each clock cycle. In the event that one result is in error, the voting logic selects the result from one of the two CPUs which agree. Although TMR provides an improvement by masking any single point CPU failure, the voting circuit is itself susceptible to single point failures.
Another prior art arrangement, which eliminates the single point failure mode of a TMR voter, is known as pair-spare (or sometimes dual-dual) redundancy architecture. This requires a 4X replication of memory and CPU. All four CPUs run in clock synchronism with two CPUs paired to form the active set and the other two CPUs paired to form the standby set. As with the TMR arrangement, all CPUs execute the identical program using identical data. The active pair has control of the system bus. If at any instant the results of the active pair do not compare, indicating a fault, control of the system bus is passed to the standby pair, which thus assumes an active status, while the faulty pair assumes an inactive status. The faulty pair is then shut down and is repaired at a convenient time. The pair-spare redundancy architecture has the potential to provide the degree of fault masking required for high safety applications, however the 4X replication of memory imposes a relatively high cost for this level of safety, particularly for applications which are memory intensive.