Modern technology has brought about many advancements in the design and implementation of computer processors. However, the possibility of errors arising in digital signals representing either data or control words is still problematic in all computer systems. An undetected error, due to a variety of fault sources, in either the processing control flow or the data may result in propagation of erroneous data each time a further operation is performed on either the data, or any data derived from the erroneous data. An error in a control word can result in rapid propagation of corrupted data and the corruption of good data by the processing with the erroneous control word. The many efforts made in recent years to minimize or contain the adverse effects of faults, as they are manifested through resultant errors, have drastically reduced the potentially devastating impact of errors on the integrity of computational results. However, error detection and recovery continue to be major concerns to computer system designers as designs are constantly being driven to higher standards of dependability, throughput, levels of integration, and computational complexity.
A variety of strategies and techniques have been proposed for error detection. Analyses to determine the optimal error detection technique must consider factors such as error detection latency and coverage. Strategies based on information redundancy and techniques for their realization yield designs with low error detection latency. The percent of detectable errors, i.e., error coverage, is often used to select the desired information redundancy technique. The range of techniques spans the use of information encoding schemes, i.e., check codes, to using a complete copy of the computer system, i.e., a redundant or duplicate system. Error check codes use a plurality of additional information, i.e., bits which are an encoded representation of the original data or control sequence in order to determine whether the data or control sequence has erroneously changed. Examples of error check codes include parity code for a data word and Cyclic Redundancy Code (CRC) for execution control sequences.
If check codes are utilized, an operation is performed so that the check code is valid after each operation. With arithmetic logic, for example, the operation may be carried out in a different number system such as with the residue number scheme, a detailed discussion of which can be found in Avizienis, A. A., "Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital Design," IEEE Trans. Comp., Vol. C-20, No. 11, November 1971, pp. 1322-1331. However, the use of a different number system involves an initial conversion to that number system, and, a subsequent conversion back from that number system after the operation is completed. Accordingly, this method of error detection may significantly reduce the performance of the data processor.
The use of redundant, or duplicate, circuitry to check for errors has long been recognized as a highly effective error checking technique. The redundant circuitry approach essentially comprises two processors, a primary processor and a redundant processor which are similarly connected to receive identical addresses, data, control signals and instructions. The primary processor, referred to as the master processor, provides normal processing and control. The redundant processor, referred to as the checker processor, runs in parallel with the master processor. If the system is operating properly, the master and checker processors operate in lock step and the results determined by the two processors should be equal or identical. Otherwise, an error has occurred in the system. This approach has the advantage that the checker processor is identical to the master processor, and therefore, can be used as a spare resource in the event that the master processor fails or become faulty. This approach, however, requires twice as much hardware as a single processor, though it has a smaller impact on performance than the check code approach discussed above. The master and checker processors typically run in parallel, and only processor outputs are used for error detection. Thus, the impact on internal processor throughput may be essentially eliminated.
Further, the redundant circuitry (also referenced to as master/checker) approach to fault detection requires that the master's data be visible to the checker. However, current trends toward increased integration on a chip and the associated computational complexity have decreased visibility to internal operations. The result of an erroneous operation which results in changing only an internal state (e.g., registers, caches, etc.) that is not visible to the checker may not be detected for a relatively long time. The error, in such a case, may only become visible when the master's state data is made visible to the checker, or another internal operation uses the state data in a manner that makes it visible. This can result in exceptionally long error detection latencies. Any effort to reduce the error detection latency by making the output of each master's operation visible to the checker is typically not practical because of pin limitations and the adverse impact on performance. A processor's internal band-width, i.e., its processing throughput, is typically much greater than the external band-width. Input/output operations are relatively slow, and therefore, it is generally considered too costly to make all the master's output data visible for checking.
Therefore, a heretofore unresolved need existed in the industry for an error detection system and method that provides improved detection in a master/checker system with minimal error detection hardware overhead, minimal error detection latency, and minimal adverse impact on performance.