The present invention relates generally to fault-tolerant data processing architecture, and more particularly to a logical processor formed from at least two processor units operating to execute identical instruction streams, executing those instruction streams in close synchrony.
Among the important aspects of fault-tolerant architecture are (1) the ability to tolerate a failure of a component and continue operating, and (2) to maintain data integrity in the face of a fault or failure. The first aspect often sees employment of redundant circuit paths in a system so that a failure of one path does not halt operation of the system. Another approach, which may be applicable to the second aspect of fault-tolerant architecture, is to use self-checking circuitry (one example of which is the xe2x80x9cduplicate and comparexe2x80x9d technique). The self-checking approach involves using substantially identical modules (e.g., processor units) that receive the same inputs to produce the same outputs, and those outputs are compared. If the comparison sees a mis-match, both modules are halted in order to prevent a spread of possible corrupt data. Examples of self-checking may be found in U.S. Pat. Nos. 4,176,258, 4,723,245, 4,541,094, and 4,843,608.
One problem with the self-checking approach, when used for fault-tolerant processor design using paired processors in a duplicate and compare configuration, is that certain so-called xe2x80x9csoftxe2x80x9d errors (e.g., a cache error seen by one of the paired processors but not the other) require both processors to be halted and restarted. Thus, the detection of the fault, and recovery from that fault, is not necessarily transparent to the user, and even if transparent, the recovery process (typically involving a halt and reboot operation) can take a relatively long time. This problem is exacerbated by the recent use of larger and larger cache memory, both internal and external. But cache errors are only one type of error that can be experienced by processors from which recovery may be attempted. Processor designs using translation look-a-side buffers with entry checking, parity checking, bus protocol checking, and the like can have one processor seeing an error while the other does not when using the duplicate and compare technique.
Thus, a technique for recovering smoothly and quickly from self-checking divergence of pairs of self-checking processor modules as a result of errors detected by one and not the other is needed.
According to the present invention, a logical processor is formed from a pair of processor units and a single memory. Both processor units execute identical instruction streams, instruction by instruction, in close synchrony. However, only the output of one of the processor units (the xe2x80x9cMasterxe2x80x9d processor unit) is used; the output of the other processor unit (the xe2x80x9cShadowxe2x80x9d processor unit) is compared to that of the Master processor unit for self checking. If a divergence is detected, the Master processor unit determines the cause of the divergence. If the Master processor unit determines that the divergence resulted from an error from which recovery is possible, it will save its present processor state to memory, cause a reset operation to be initiated to reset both the Master and Shadow processor units and reinitialize them, using the prior saved state. Thereby, both processor units quickly and smoothly recover from the detected error to resume operation from about the point at which the error was encountered.
In a further embodiment of the invention, when divergence between the two processor units is detected, output data transmission from the logical processor is immediately, but temporarily, suspended in order to prevent the spread of possibly corrupt data through the larger system that may be incorporating the logical processor. When the Master processor determines that recovery is possible, data transmission is resumed.
In a still further embodiment of the invention, a timer is periodically preset with a predetermined value and allowed to count down (or up). If the timer is allowed to time out (i.e., reach another predetermined value) before being preset again, the logical processor will be subjected to a hard reset and reboot operation. This feature operates to preclude the logical processor from entering a loop of error detection and recovery (or any other loop) from which it cannot escape.
A number of advantages flow from the present invention. The reset and reinitialization process in the face of a divergence, i.e., saving state to memory, resetting the processor units, and restarting them from the prior saved processor state, is much quicker than the prior use of halting the processors and reloading them where processor state (and processes therein) are lost, backup takeovers happen, and persistent processes are restarted. The present invention provides a substantially transparent recovery from many of the xe2x80x9csoftxe2x80x9d failures that can be encountered by self-checking processors.
Additionally, the present invention may be implemented using commercially-available microprocessors as long as the source of an error can be determined to be due to one microprocessor or the other.
These and other aspects and advantages of the invention will become apparent to those skilled in the art upon reading of the following detailed description of the invention, which should be taken in conjunction with the accompanying drawings.