1. Technical Field
The present invention generally relates to fault-tolerant digital computing systems. More specifically, the present invention relates to a system for quickly recovering from transient multi-bit data failures within a run-time memory array.
2. Background Information
Some digital computing system applications require a high degree of safety. For example, an aircraft flight control computer for safe operation depends upon continuous error-free computing operation for the entire period of flight. It should be recognized that error-free operation requires the elimination of, or containment of faults within the digital computing system. For many aircraft applications, the probability of an undetected failure must be less than 10−9 per flight hour. In addition to ever-increasing demands of reliability, a fast computing system with increased throughput is necessary for flight control.
A malfunction of any single component in a conventional computing system will result in an unsafe error. This is known as a series reliability model, wherein the probability of an unsafe error is the sum of the probability of the malfunction of each component. A system corresponding to this model is sometimes referred to as a “single thread system.” In prior art computing systems, a single-thread memory system complimented with an off-the-shelf error detection and correction linear block code has been utilized to attempt to meet required failure probability levels. However, such a method does not satisfy required safety levels or processing throughput requirements.
It is the goal of fault tolerant computing systems to provide the greatest possible reliability with the most cost effective approach. In some instances, redundancy actually undercuts the reliability improvements being sought by the added redundancy. Reliability improvement can be directed toward improving the availability of the system, i.e., the percentage of time the system is available to do useful work, or the safety of the system, i.e., the probability that the system will perform error-free for a specified mission time. U.S. Pat. No. 5,086,429 to Gray, et al., issued Feb. 4, 1992 and presently assigned to the assignee hereof, shows a computing system in which error correction capability is sacrificed, decreasing availability, in order to achieve a higher degree of safety.
U.S. Pat. No. 5,086,429 describes a fail-operative, fail-passive, fault tolerant computing system, which includes a first and second pair of substantially identical processors connected to a system bus with one pair being arbitrarily designated as the “active” pair while the other is designated as a “hot stand-by” pair. Each processor is operated in locked step fashion. Rather than providing individual memory arrays for each processor in each pair, the two processors in each pair share a common memory. A bus module examines the binary data and address transmissions carried by data buses and address buses for the active pair to determine whether discrepancy exists in the information being simultaneously transferred over the address and data buses for that pair of processors. The standby pair is likewise configured.
Error detection logic, including a linear block code generator, operates during writes to memory by the processor so as to encode the datawords that are to be written to memory, creating a series of checkbits associated therewith. The datawords along with the checkbits are stored in the memory as a linear block codeword. During a read initiated by the processors, an appropriate codeword is addressed by the processors and read from the memory. The checkbits of the codeword are examined for correctness by a set of syndrome generators, one associated with each processor; the syndrome generators determining whether an error exists in the codeword read from memory. When such an error is detected, a signal is sent to bus monitor logic to cause a switchover such that the designated standby pair becomes the active pair. The faulted pair will record the fault and may either remain faulted, or in the case of a transient or soft fault become the stand-by pair.
It is well known in the prior art to employ a linear block code, also known as an [n,k] code, comprised of a set of n binary digits wherein any subset of k binary digits represent the data portion of the code and the remaining binary digits, n-k, represent binary digits of the code which may be used for error detection and/or error correction. A specific instance of a given code is commonly called a “codeword.” For example, a 9,8 code (8 data bits and 1 error checkbit) can generate 512 unique 9-bit codewords. A 9,8 code provides a simple parity check of an 8-bit dataword which is capable of detecting a single bit error but would miss the detection of an even number of bits in error and provide no capability to correct errors. As the number of checkbits is increased, the capability of the code to detect and/or correct random errors improves because as the number of checkbits increases the fraction of all possible codewords which are valid codewords, decreases, thus increasing the probability that a given error will result in an invalid codeword being detectable.
Hamming weight of a given linear block code is the measure of its error detecting capability, i.e., the Hamming weight is the maximum number of binary digits a given dataword may be in error while still assuring error detection by utilization of the linear block code. When the number of binary digits in error exceeds the Hamming weight, there is the possibility that the error in excess will transfer the codeword into a valid and therefore undetectable codeword. The logical properties of the linear block code generator, usually expressed in the form of a code matrix, and commonly referred to in the art as the H matrix, determines the specific error detection/error correction capabilities of the code.
U.S. Pat. No. 5,909,541 to Sampson, et al., issued Jun. 1, 1999 and presently assigned to the assignee hereof, shows a computing system utilizing linear block codes that corrects single bit data failures. U.S. Pat. No. 5,909,541 describes a computer system that combines the redundant memory arrays of a traditional two-lane locked step, fail-passive processing pair into a shared memory array. Each lane of the locked step system includes an error detection and correction module for detecting and/or correcting single bit errors. An error detection and correction optimized linear block code is leveraged over multiple datawords.
However, these prior art systems are unable to correct multi-bit data failures. As the geometry sizes of computer system components has been decreasing and the amount of memory has been increasing, the probability of a multi-bit data failure within a memory array has increased. In addition, those computer systems that operate at high altitudes, such as computer systems in air vehicles, are especially susceptible to single event upsets (SEUs). For example, a SEU can be triggered by secondary and tertiary particles generated from cosmic radiation which can cause changes in the data leading to multi-bit data failures.
Thus there exists a need for a computing system utilizing linear block codes that is able to detect and correct transient multi-bit data failures and which meets ever increasing speed and reliability requirements with reduced redundancy and improved throughput.