In the field of digital computers, so-called fault tolerance is an important aspect where the computer is to be used in an environment where it may not be possible to gain access to the computer for maintenance purposes either because of circumstances or because of location. Thus, fault tolerance is a major factor in space and military applications. In this regard, in a fault tolerant computer system the computers, their components, the programs running in them, and the peripherals attached thereto are all provided with back-up capabilities so that any particular entities functions can be assumed by another entity in the event of failure from any source of for any reason. This typically imposes a high overhead on the system as well as additional complexity and cost; but, it is necessary in many instances. The alternative being complete system failure. As can be appreciated, this alternative is not acceptable for a multimillion dollar space probe, or the like. While non-critical functions may have to be eliminated and overall performance may degrade, the functions essential to success of the mission must be maintained.
A particularly hard portion of a digital computer to error check and correct is the memory itself. The systems and applications programs that operate in the system as well as the data they manipulate and produce is contained in random access memory (RAM). Diagnostic programs in the RAM can detect and correct or bypass other system defects. If a CPU fails to provide a proper response to a diagnostic input, its functions can be transferred to another CPU. But, how do we know that the memory is working properly, i.e. that it is reading and writing binary information without losing or picking up bits? This is an area of great concern to those working in the design of fault tolerant computers and their memories.
In many applications of fault-tolerant computer systems, such as deep space exploration or earth-orbiting satellites, a relatively long time may transpire between occurrence and detection of a fault. A fault that has occurred but not yet generated an error is referred to as a "dormant fault" and an error that has been generated by the dormant fault but not yet detected by error checking circuitry is called a "latent error". If dormant faults and latent errors are not detected and corrected promptly after they occur, multiple faults or errors can accumulate. This can jeopardize the fault recovery mechanisms in most fault-tolerant systems since they are only designed to cope with single faults. It should be noted that the effect of latent faults has been studied extensively by those skilled in the art.
It is known that classical error-detection techniques such as duplication-and-comparison, voting, error-detecting and correcting codes and self-checking logic are not capable of detecting dormant faults and latent errors. This is because these techniques cannot detect a fault unless the faulty circuit is exercised in such a way as to cause a logic error to appear at a checking circuit. In normal system operation, however, the input required to exercise (i.e. trigger) the faulty circuit may not occur over a relatively long time, or not at all. One way to detect these faults is to suspend system operation and check all data and components. This approach, of course, causes prolonged interruption of normal system operation and may not be used for many applications, such as real-time systems. Other approaches to alleviate the dormant fault problem is to increase resiliency against multiple faults by increasing redundancy (e.g., by using a 3-out-of-5 system) or by employing multiple error-correcting codes. Unfortunately, these techniques require large hardware overhead and do not solve the fundamental problem of exposing these error and fault conditions quickly.
Prior art Self-Exercising (SE) techniques of the inventors herein can detect the presence of dormant faults and latent errors shortly after their occurrence while normal system operation is in progress. These techniques first enhance the testability of major system components (memory, data path, control circuitry, etc.) in a fault-tolerant system by augmenting their internal logic structure. Then, test cycles to detect faults in these components are interleaved with normal system operations. Each test cycle is a small portion of the complete test of the components. Hence, these test cycles are very short and can be applied at a relatively high rate (e.g., once every 100 .mu.sec) without causing observable interruption to normal system operation. Since the components are designed to be highly testable, a complete test only requires a small number of test cycles (e.g., approximately 100 for non-large systems). Thus, in a self-exercising system the maximum error latency, which is by definition the time required to perform the complete test, is also small. Self-exercising design has many applications, especially in those environments where high transient fault rate is expected, such as planetary explorations and some military applications.
While the above-described self-exercising technique has advantages as described, it also has certain drawbacks as well. First, although normal system operation is not interrupted, fault detection by self-exercising does cause a few percent performance degradation. Second, isolation of a latent error after it is detected requires fairly lengthy procedures. The self-exercising techniques have to suspend system operation in order to isolate or locate the error. This may not be acceptable if system operation is time critical. Besides, if the transient fault arrival rate is high, significant performance may be lost due to the fault isolation. Third, when multiple latent errors occur, the above self-exercising techniques are either unable to isolate each individual error or fail to detect the occurrence of multiple errors entirely. Thus, the probability of survival of such systems would decrease rapidly if the transient fault arrival rate is very high.
What is required, therefore, is a memory system design which can detect latent errors instantly without the need of explicit test. Then, the isolation of detected errors can be done simultaneously with the normal operations. Furthermore, it should detect and isolate and thus correct most of multiple latent errors.