1. Field of the Invention
This invention relates to so-called lockstep processor systems. That is, systems with at least two processors (commonly designated as master and slave) processing independently, but in lockstep, the same task with their independently generated results compared in order to detect an error originating in one processor. More particularly, the invention relates to an improved lockstep processor system that provides partial fault isolation in addition to error detection.
2. Description of the Prior Art
In general, prior art lockstep processor systems are designed to detect a failure, and do not typically include hardware and/or software to isolate the source of the error or the processor (master or slave) where the error occurred.
More specifically, in lockstep processors, typically the outputs are tied together. One of the processors is declared as the master and is allowed to drive the outputs, while the other processor is the slave and is only allowed to receive on its outputs. The slave processor compares the master's outputs with its internal outputs to ensure that they are the same. If on a line-by-line basis the outputs do not match, the whole system is stopped due to an error. The problem with this method is the lack of ability to isolate the failure to either the master or the slave processor, since typically code checking is not used on these lines and a consequential lack of ability to recover from a failure by operating in a degraded mode but without stopping the processor.
A classical lockstep technique, implemented with respect to the central output bus, is shown in FIG. 1. Classical lockstep techniques are only concerned about control lines during cycles they are transmitting data. A bit-by-bit compare is performed between the data that is internally generated by one processor (slave processor) and data that has been placed on the external lines by the other processor (master processor). This comparison is valid only during cycles where data is being sent from the master processor. The comparison logic is activated in the slave processor and upon detecting any difference between the processor generated outputs on corresponding lines, the slave processor stops the system and waits for some higher level to initiate isolation and recovery actions.
In lockstep processors, the I/O bus is typically compared only during the cycles that the processor is transmitting data to an I/O device. When the processors are receiving data from an I/O device, the lockstep processors rely on internal data checking to find errors or rely on the processors eventually getting out of sync as a result of the errors. Once out of sync, the processors are stopped since there is no method in the prior art lockstep processor system to determine which processor is at fault. One object of this invention is to determine whether the master or slave processor has failed, to recover from the failure by degrading error detection, and, in addition, to provide lockstep coverage when receiving data from an I/O device.
In the prior art, detection of errors on input lines to lockstep processors, after their synchronization, has essentially been ignored. An assumption has been that both processors saw the same error and would respond in the same manner. If only one processor saw an error, the processors would no longer be synchronized. This lack of synchronization would eventually be detected at an output; but the only recovery action would be to reset the system. Another object of this invention is to detect error inputs to dual lockstep processors, isolate error inputs to a single processor before it corrupts the system, and disable the processor with error inputs.
Lines to a processor that are only inputs typically use either interrupt/error busses or control busses. In general such busses are protected from errors by at least one of the following four methods: parity checking, code checking, duplication, or protocol checking. A typical input bus of this type, with its corresponding checking, is shown in FIG. 1.
In the classical dual lockstep processor configuration, one processor is the master, and it communicates with memory and I/O devices, while the other processor is the slave and it compares the outputs from the master to its internally generated signals to ensure that the master processor is synchronized with the slave. If the processors are not synchronized, the system is stopped.
Due to the large width of the memory address bus, a popular method of detecting any address failure is to perform lockstep comparison on the processor chips. The master processor and slave processor share the memory address lines. The master processor is allowed to drive the address and the slave processor compares what the master has driven to what the slave has generated internally on a bit-by-bit basis. If an error is detected, the system is stopped since comparison checks indicate only that there was a failure but not who failed. With this prior art method there is no isolation and the system cannot automatically change to a degraded mode of operation. A significant delay is incurred to attempt to isolate the error in prior art lockstep processor systems. In addition, the current method can only isolate the failure if the failure is detectable by the diagnostic routines; i.e. the error is not transient.
In prior art lockstep processor systems, the memory data bus is compared only during the cycles that the processor is writing memory. When the processor is reading memory, the lockstep processors rely on internal checking of the data to find errors or rely on the processors eventually getting out of sync. Once out of sync, the processors will be stopped since there is no method to determine who is at fault. An object of this invention is to determine whether the master or slave processor has failed, to recover from the failure by degrading error detection, and to provide error coverage during subsequent memory reads.