1. Field of the Invention
This invention relates generally to a method and apparatus for detecting the occurrence of errors in digital computing systems and more particularly to a system in which arithmetic operations are carried out simultaneously in two identical hardware modules with the results (with a large number of signals) being compared and using complementary residue codes to minimize the number of interface signals required.
2. Discussion of the Prior Art
With the ever increasing complexity of digital computing systems and the also increasing requirements for better system availability and lower mean time to repair, improved techniques for the detection and isolation of logical malfunctions or faults is essential. Although known software techniques have been successfully employed to provide a measure of fault detection and isolation, they suffer from two major drawbacks. The first is that the software program performing the fault detection may not detect the fault before the error has propagated to the point where critical data has been corrupted, thus making the process of recovery very difficult or, in some cases, impossible. The second drawback is that while software techniques can provide an indication that an error has occurred, they are generally ineffective in sufficiently localizing the fault so that a rapid maintenance measure, i.e., the substitution of a good circuit card or module for the failed unit, can be expeditiously accomplished.
A more optimum solution to the detection and isolation of faults is to employ dedicated hardware fault-detection circuits which are capable of detecting and trapping a fault at or very near the instant in which it occurs. By doing this, not only is the corruption of data prevented, but also the machine state at the instant of the fault can be examined to aid in the diagnosis and isolation of the fault. Ideally, such fault detection hardware should add a minimum of additional circuitry to perform this function and at the same time should provide the highest possible probability of both fault detection and fault isolation.
A well-known technique which is effective for checking much of the logic employed in the digital computer is the use of parity. When a data word or operand is generated, its corresponding parity is generated and that parity passes, along with the data, through the various modules comprising the digital computer. At each juncture, the parity is again generated and compared to the parity transmitted. Thus, any single-bit error will result in a parity error and the precise location of that parity error will provide the required fault isolation. With more extensive fault detection logic, this approach can be extended to provide detection of multi-bit errors.
The simple technique of parity detection does not work, however, if parity is not preserved between the inputs and the outputs of a logical function. A prime example of this is the arithmetic logic unit (ALU) common to all digital computers which perform many functions, including some, such as logical operations, which do not lend themselves to coding techniques for checking, in which a data word (operand A) is combined with another data word (operand B) to provide the arithmetic result. There is no logical method to relate the parity of the arithmetic result with the parity of the two operand inputs, short of performing the same arithmetic operation. This suggests an approach for this class of logic in which an identical ALU is used to perform identical logical steps and the arithmetic results are compared at each step for equality. Other reasons for using duplicate checking are to achieve better error coverage than can be obtained using coding techniques and to enhance performance where code regeneration adds to data delay times. While the method of duplication comparison assures 100 percent detection of any single-point failure, it imposes an unacceptably high hardware overhead for the error detection logic in that a very large number of bits, in the range of 50 to 100, must be compared and for each comparison, an I/O pin must be provided on each of the arithmetic modules to allow the exchange of data for comparison purposes.
Residue coding is another technique which can be used to check computer operations. It has the advantage of working well to check arithmetic operations and its error detection coverage can be increased by choosing a larger modulus. In terms of a duplication/comparison checking utilization, both sets of bits to be compared can generate their corresponding residue codes, and only the residue codes need to be compared thus obtaining fairly high error coverage while requiring many fewer interface signals to be compared than the original set of bits. Dr. F. F. Paal describes a method by implementing residue code generators of modulus 2.sup.a-1 using a carry-save-adder tree and full adder. This method allows the building of such generators in a reasonable amount of logic and fast enough to be utilized in a comparison implementation. This technique will be described in further detail below.
The comparison of residue codes generated by two identical ALU's is effective in always detecting a single fault which results in a single erroneous output of the accumulator since, for this condition, the residue code generated will always be different from the residue code generated from the functional (non-error producing) arithmetic unit. Unfortunately, there is a very large class of faults which can occur that result in multiple output bits being in error. In this case, a malfunctioning ALU may generate the same residue code as the functional unit. In fact, the probability that this will happen for this class of faults is 1/m. The resulting dilemma is that if m is chosen to be small, the probability of fault detection is poor, and if m is chosen to be large, the overhead in terms of comparison logic and input/output pins may become too high.
Algirdas Avizienis also describe the use of residue codes in error checking logic. In the Jet Propulsion Laboratory Technical Report No. 32-711 titled "A Study of the Effectiveness of Fault-Detecting Codes for Binary Arithmetic", which he authored, a mathematical analysis shows that the use of multiple check factors (residue codes) improves the effectiveness of fault detection. The mathematical relationship described in this article is utilized in the present invention.