The invention relates generally to fault tolerant processing systems using at least a pair of lock-step processors for error-checking, and more particularly to a method, and apparatus implementing that method, of passing dissimilar information between the lock-stepped processors. Among the important aspects of fault-tolerant architecture are (1) the ability to tolerate a failure of a component and continue operating, and (2) to maintain data integrity in the face of a fault or failure. The first aspect often sees employment of redundant circuit paths in a system so that a failure of one path will not halt operation of the system. Both aspects may use self-checking circuitry, which often involves using substantially identical modules that receive the same inputs to produce the same outputs, and those outputs are compared. If the comparison sees a mismatch, both modules are halted in order to prevent a spread of possible corrupt data. Examples of self-checking may be found in U.S. Pat. Nos. 4,176,258, 4,723,245, 4,541,094, and 4,843,608.
One particularly strong form of self-checking error detection is the use of processor pairs (and some of the associated circuitry) operating in xe2x80x9clockstepxe2x80x9d to execute an identical or substantially identical instruction stream. The term lockstep refers to the fact that the two processors execute identical instruction sequences, instruction-by-instruction. According to this technique, often referred to as a xe2x80x9cduplicate and comparexe2x80x9d technique, the processor pair receives the same input information to produce the same results. Those results are compared to determine if one or the other encountered an error or developed a fault. The strength of this type of error detection stems from the fact that it is extremely improbable that both processors will make identical mistakes at exactly the same time.
Fault tolerant designs often also use some form of error correction code to protect the main memory of a processor, providing the processor the ability to take a fail fast approach. That is, when the processor detects an error, it simply stops. Recovery from such an error stop is not the responsibility of the processor; rather, recovery is accomplished at the system level. The only responsibility of the processor is to stop quicklyxe2x80x94before any incorrect results can propagate to other modules. The lockstep/compare approach to processor error detection fits nicely with this fail-fast approach. In principle, when a divergence between the lockstep operation of the processors is detected, the processors could simply stop executing.
As integrated circuit technology has advanced, more and more circuitry can be put on an integrated chip. Thus, on-chip processors (microprocessors) are capable of being provided very large cache memories that bring with them the advantage of fewer main memory accesses. However, such cache memories are subject to soft (correctable) errors produced, for example, by Alpha particle emissions and cosmic-ray induced errors. Accordingly, it is common to find such caches protected by error correcting codes. Otherwise, the error rate of these on-chip memories would cause processor failures at a rate that is not tolerable, even by non-fault-tolerant system vendors. The error correcting codes allow the processor to recover from these soft (correctable) errors in much the same way as main-memory ECC have allowed most soft memory errors to be tolerated. However, this gives rise to a nasty side-effect in lockstepped designs: The detection and recovery from a correctable cache error will usually causes a difference in cycle-by-cycle behavior of the two processors (a divergence), because the soft error occurs in only one of the two devices.
One solution to this problem is to have the error correction logic always perform its corrections in-line (a.k.a. in xe2x80x9czero timexe2x80x9d), but this approach can require extra circuitry in the access path, resulting in slower accesses even in the absence of the error. This approach, therefore, is often deemed unacceptable for high speed designs because of the associated performance penalty.
Another approach is to present any detection of divergence between the two processors to the software as an interrupt, and the processors keep running. The software determines whether the divergence is due to a recoverable soft error or to a xe2x80x9ctruexe2x80x9d divergence due to a miscomputation by one of the processors. If the error is deemed recoverable, necessary state is saved to memory, the microprocessors are reset and brought back into lockstepped operation, the state is restored from memory, and computation resumes from the point of interrupt. If the error is deemed not recoverable, then the software just halts. An, example of this approach can be seen in U.S. application Ser. No. 09/201,635, now U.S. Pat. No. 6,393,582, assigned to the assignee of the invention described and claimed herein. However, this approach requires the cycle by cycle of the processors to be halted, the error checked, and the system restarted if necessary. For processor systems incorporating very large cache memories, as are becoming available today, that continual halting for the expected many soft/correctable errors can be unacceptable.
Soft errors encountered on cache accesses can be self-correcting with today""s error correcting codes, as indicated, with no visible time loss. There is no divergence during the soft error recovery. They do not require a reset to recover. However, it is good practice to log each occurring error (i.e., record the memory address at which the error occurred, and track how many times this memory address experiences errors) and to xe2x80x9cscrubxe2x80x9d the memory location. (xe2x80x9cScrubbingxe2x80x9d a memory location is a read of the memory location, followed by writing back to the memory location the value just read therefrom, followed by another read. In this way the memory location experiencing an error is checked to see if the error was transitory, i.e., a soft and correctable error.) The procedure of scrubbing a correctable memory error that is encountered by one, but most likely not the other, of a pair of lockstep processors would cause them to diverge onto to different code paths, resulting in a detection of divergence between them, and most likely causing them to halt.
Thus, it can be seen that a way to provide lockstep processors with the ability to handle soft error logging and scrubbing without resorting to a reset operation or a divergence is needed.
The present invention provides a simple, effective technique for allowing lockstep processors to handle a correctable memory error in one of the lockstepped processors. The invention provides a simple method that allows the processors to exchange dissimilar information without diverging to the identical instruction streams they are executing.
Broadly, according to the present invention, a pair of lockstep processors, executing an identical instruction steam will include conventional error-correcting circuitry that detects memory errors encountered when reading cache, corrects the error (if correctable), and logs to a status register such each correctable memory error, recording such information as the memory location at which the error occurred and how many times correctable errors are encountered over some set period of time. The address of each memory location at which an error is encountered is written to an error address register. At predetermined points in time, the lockstep processors will read the content of the status register, and write that content to an address identifying a first storage location of a storage unit external to the processors. However, the write address used by one of the processors is redirected (during the write operation) to a second storage location of the storage unit, resulting in the content of the status registers of each of the lockstep processor being stored. Then, the processors read both of the storage locations just written sequentially. During the read operations, the address used by the other processor is not redirected. Thereby, the content of the status register of each of the lockstep processors has been provided to the other of the lockstep processors. Then, the processors read both of the storage locations just written sequentially. During the read operations, the address used by the other processor is not redirected. Thereby, the content of the status register of each of the lockstep processors has been provided the other of the lockstep processors.
In a further embodiment of the invention, the lockstep processors go through an identical code sequence to check and see if the status registers indicate that soft errors were encountered. If so, the lockstep processors go through the same procedure described above to exchange the contents of their respective error address registers, thereby providing each with the memory locations that have experienced correctable errors and need to be purged. The lockstep processors proceed to purge each such memory location, regardless of whether it is needed by the particular processor or not. A timer is then reset to establish the next error-recording period.
There are a number of advantages achieved by the invention. Lockstep processors are capable of handling soft error recovery without diverging code execution resulting in having to utilize a reset recovery.
These and other aspects and advantages of the present invention will become apparent to those skilled in this art upon a reading of the following description of the specific embodiments of the invention, which should be taken in conjunction with the accompanying drawings.