1. Field of the Invention
The present invention relates to high reliability processing, by hardware redundancy. More particularly, the invention relates to a processing system with pair-wise processors that operate in a high reliability mode to detect computational errors, and operate independently in a high performance mode.
2. Description of the Related Art
Various approaches exist for achieving high reliability processing. FIG. 1 illustrates one prior art processor 100 for high reliability processing. The processor 100 includes two execution units 130 and 135, which are both the same type of arithmetic unit. For example, the two execution units could both be floating point units, or integer units. The processor 100 has architected registers 120 for holding committed execution results. The two execution units 130 and 135 both execute the same instruction stream in parallel. That is, for each instruction an instance of the instruction executes in each respective execution unit 130 and 135. Then, when the two units are ready to commit the result for an instruction to the register file 120, the two versions of the result are compared by compare unit 125. If the compare unit 125 determines that the versions are the same, then the unit 125 updates one or more of the registers 120 with the result. If the versions do not match, then other actions are taken. In one implementation, a counter records whether an error is occurring repeatedly, and if it is, the error is classified as a xe2x80x9chardxe2x80x9d failure. In the case of a hard failure, the instruction issue mechanism does not reissue the faulting instruction, but instead executes a xe2x80x9ctrapxe2x80x9d instruction. One such trap leads to a micro code routine for reading out the state of the defective processor and loading it into a spare processor, which restarts execution at the instruction that originally faulted. In an alternative, where no spare processor is available, the trap leads to the operating system migrating the processes on the faulty processor to other processors, which adds to the workload of the other processors.
While this arrangement provides a reliability advantage, it is disadvantageous in that the processor design is more complex than a conventional processor and has greater overhead. Moreover, it limits the processor 100 throughput to have two execution units 130 in the processor 100 both executing the same instruction stream. Another variation of a processor which is designed for exclusively high reliability operation is shown in Richard N. Gufstason, John S. Liptay, and Charles F. Webb, xe2x80x9cData Processor with Enhanced Error Recovery,xe2x80x9d U.S. Pat. No. 5,504,859, issued Apr. 2, 1996.
FIG. 2 illustrates another arrangement for high reliability processing. In this voting arrangement, three processors 200 each execute the same program in parallel and versions of a result are compared at checkpoints in the program on a bus 160 external to the processors 100. If the versions do not match, then other actions are taken, such as substituting a different processor 100 for the one that produced the disparate version. This arrangement is advantageous in that complexity of the individual processors 200 is reduced, and an error producing processor can be identified. Also, the throughput of one of the processors 200 may be greater than that of the one processor 100 in FIG. 1, since the individual processor 200 does not devote any of its execution units to redundant processing. However, the arrangement of FIG. 2 is redundant at the level of the processors 200, and uses three whole processors 200 to recover from a single fault. Also, the error checking is limited to results which are asserted externally by the processors.
From the foregoing, it may be seen that a need exists for improvements in high reliability processing.
The foregoing need is addressed in the present invention. According to the invention, in a first embodiment, a multiprocessing system includes a first processor and a second processor. Each of the processors have their own data and instruction caches to support independent operation. In a first mode, a xe2x80x9chigh performancexe2x80x9d mode, the processors independently execute separate instruction streams. In a second mode, a xe2x80x9chigh reliabilityxe2x80x9d mode, both processors execute the same instruction stream. That is, for an instruction in the stream each processor computes its own version of a result.
The system includes a compare unit for indicating whether the respective versions match. If the versions do not match for an instruction, the instruction is deemed to be a faulting instruction. Responsive to the system being in the high reliability mode and the compare unit indicating a faulting instruction, the processors recover a state that the processors had prior to execution of the faulting instruction, and the processors re-execute the faulting instruction.
In an embodiment, each of the processors has a respective signature generator. Each of the signature generators is coupled to the compare unit. Responsive to the respective versions, the signature generators assert signatures to the compare unit, so that a faulting instruction may be detected.
In another aspect, each processor has its own respective commit logic. If the compare unit receives matching signatures for corresponding versions of a result, the compare unit signals the commit logic in each respective processor that the possibility has been eliminated of a calculation interrupt arising for that instruction. This permits the commit logic to commit the result. If the signatures do not match, the compare unit signals the commit logic that the corresponding instruction has faulted. In response, the commit logic permits instructions prior to the faulting instruction in program order to continue execution, but flushes instructions, and their results, that follow the faulting instruction in program sequence. Alternatively, the commit logic flushes those results that were produced by the faulting instruction, and only selected instructions results subsequent in program order to the faulting instruction, that is, those instructions and their results dependent on the faulting instruction.
In still another aspect, in one embodiment such a signature includes a bit indicating parity for the signature""s corresponding version of the result. For one such embodiment, the signature consists of a single parity bit. In an alternative, the signature includes a number of parity bits for respective subsets of its version. In another embodiment, the signature includes a sum for all the bits of its version of the result. In another embodiment, the signature includes the entire version itself.
In another aspect, the system includes complete logic for generating an error correction code for including as part of the processor state with an instruction result. For such a instruction result, the signature generators produce their respective signatures in response to their respective result versions, including the error correction codes for the versions.
In a still future aspect, in the high performance mode, in which the processors execute separate programs or instruction streams, each processor will have independent bus accesses through its own respective bus logic. For this circumstance, mode control logic notifies arbitration logic in the bus interface unit to arbitrate between the independent bus requests of the two bus logic units.
In the high reliability mode, in which the two processors both execute the same program or instruction stream in parallel, each processor will need identical, lockstep bus accesses. For this circumstance, mode control logic notifies arbitration logic in the bus interface unit to allow only one of the bus logic units to control bus requests and read the bus for both processors in the system.
In a further aspect, since the processors are subject to external interrupts, which can disturb synchrony unless coordinated properly, the bus interface unit for the system has common external interrupt logic which responds to external interrupt requests and signals both processors simultaneously to respond to the interrupt request. The response may include merely setting a bit in a register for later follow up, or it may include causing the processor to branch to a micro code routine, execute a trap instruction calling an operating system routine, or even terminate dual execution of an instruction stream, so that the processors terminate in synchrony.
In another embodiment, a method for multiprocessor operation includes a step of selecting an operating mode. Responsive to being in a high performance mode, two processors independently execute separate instruction streams. Responsive to being in a high reliability mode, the two processors concurrently execute instructions of one instruction stream, wherein each of the processors computes a respective version of a result for an instruction in the stream.
In a further aspect, responsive to the respective versions of an instruction result, signature generators assert signatures to a compare unit, so that a faulting instruction may be detected. Responsive to the system being in the high reliability mode and the compare unit indicating a faulting instruction, the processors recover a state that the processors had prior to execution of the faulting instruction, and the processors re-execute the faulting instruction. Responsive to the system being in the high reliability mode and the compare unit indicating a correctly calculated instruction, commit logic for each respective processor commits the result in each processor.
In another aspect, the respective versions of an instruction result include an error correction code, and in a method step the signature generators produce their respective signatures in response to their respective result versions, including the error correction codes for the versions.
In another aspect, in the high performance mode, in which the processors execute separate programs or instruction streams, each processor will have independent bus accesses through its own respective bus logic. For this circumstance, a method embodiment includes a step of mode control logic notifying arbitration logic in the bus interface unit to arbitrate between the independent bus requests of the two bus logic units. In the high reliability mode, in which the two processors both execute the same program or instruction stream in parallel, each processor will need identical, lockstep bus accesses. For this circumstance, the method embodiment includes a step of mode control logic notifying arbitration logic in the bus interface unit to allow only one of the bus logic units to control bus requests and read the bus for both processors in the system.
In a further aspect, a method embodiment includes the step of a bus interface unit for the system synchronously signaling both processors to respond to the interrupt request.