1. Field of the Invention
The present invention relates to high reliability processing, by hardware redundancy. More particularly, the invention relates to a processing system with pair-wise processors that operate in a high reliability mode to detect computational errors, and operate independently in a high performance mode.
2. Related Art
Various approaches exist for achieving high reliability processing. FIG. 1 illustrates one prior art processor 100 for high reliability processing. The processor 100 includes two execution units 130 and 135, which are both the same type of arithmetic unit. For example, the two execution units could both be floating point units, or integer units. The processor 100 has architected registers 120 for holding committed execution results. The two execution units 130 and 135 both execute the same instruction stream in parallel. That is, for each instruction an instance of the instruction executes in each respective execution unit 130 and 135. Then, when the two units are ready to commit the result for an instruction to the register file 120, the two versions of the result are compared by compare unit 125. If the compare unit 125 determines that the versions are the same, then the unit 125 updates one or more of the registers 120 with the result. If the versions do not match, then other actions are taken. In one implementation, a counter records whether an error is occurring repeatedly, and if it is, the error is classified as a xe2x80x9chardxe2x80x9d failure. In the case of a hard failure, the instruction issue mechanism does not reissue the faulting instruction, but instead executes a xe2x80x9ctrapxe2x80x9d instruction. One such trap leads to a micro code routine for reading out the state of the defective processor and loading it into a spare processor, which restarts execution at the instruction that originally faulted. In an alternative, where no spare processor is available, the trap leads to the operating system migrating the processes on the faulty processor to other processors, which adds to the workload of the other processors.
While this arrangement provides a reliability advantage, it is disadvantageous in that the processor design is more complex than a conventional processor and has greater overhead. Moreover, it limits the processor 100 throughput to have two execution units 130 in the processor 100 both executing the same instruction stream. Another variation of a processor which is designed for exclusively high reliability operation is shown in Richard N. Gufstason, John S. Liptay, and Charles F. Webb, xe2x80x9cData Processor with Enhanced Error Recovery,xe2x80x9d U.S. Pat. No. 5,504,859, issued Apr. 2, 1996.
FIG. 2 illustrates another arrangement for high reliability processing. In this voting arrangement, three processors 200 each execute the same program in parallel and versions of a result are compared at checkpoints in the program on a bus 160 external to the processors 100. If the versions do not match, then other actions are taken, such as substituting a different processor 100 for the one that produced the disparate version. This arrangement is advantageous in that complexity of the individual processors 200 is reduced, and an error producing processor can be identified. Also, the throughput of one of the processors 200 may be greater than that of the one processor 100 in FIG. 1, since the individual processor 200 does not devote any of its execution units to redundant processing. However, the arrangement of FIG. 2 is redundant at the level of the processors 200, and uses three whole processors 200 to recover from a single fault. Also, the error checking is limited to results which are asserted externally by the processors.
In the related application, a pair of processors use state-of-the-art state recovery mechanisms that are already available for recovering from exceptions and apply these mechanisms to operate in lockstep synchrony in a high reliability mode. This is highly advantageous because it achieves the high reliability without extensive modification to existing processor design. However, it is somewhat limiting because of the required synchrony. That is, in the high reliability mode the processors in the related application must process a stream of instructions in the same sequence.
From the foregoing, it may be seen that a need exists for improvements in high reliability processing.
The foregoing need is addressed in the present invention. According to the invention, in a first embodiment, a multiprocessing system includes a first processor, a second processor, and compare logic. The first processor is operable to compute first results responsive to instructions, the second processor is operable to compute second results responsive to the instructions, and the compare logic is operable to check at checkpoints for matching of the results. Each of the processors has a first register for storing one of the processor""s results, and the register has a stack of shadow registers. The processor is operable to shift a current one of the processor""s results from the first register into the top shadow register, so that an earlier one of the processor""s results can be restored from one of the shadow registers to the first register responsive to the compare logic determining that the first and second results mismatch. It is advantageous that the shadow register stack is closely coupled to its corresponding register, which provides for fast restoration of results.
In a further aspect of an embodiment, each processor has a signature generator and a signature storage unit. The signature generator and storage unit are operable to cooperatively compute a cumulative signature for a sequence of the processor""s results, and the processor is operable to store the cumulative signature in the signature storage unit pending the match or mismatch determination by the compare logic. The checking for matching of the results includes the compare logic comparing the cumulative signatures of each respective processor. It is faster, and therefore advantageous, to check respective cumulative signatures at intervals rather than to check each individual result.
Also, in one embodiment, the instructions have a certain instruction sequence and at least one of the processors may execute instructions in a sequence different than the program sequence, but both of the processors execute store-type instructions according to a sequence in which the store-type instructions occur in the certain instruction sequence. The checkpoints are responsive to store instructions, so that a first sequence of results for the first processor ends at one of the checkpoints with a result for one of the store instructions and a second sequence of results ends at the checkpoint for the second processor with a result for the same one of the store instructions. It is advantageous to trigger checkpoints responsive to store-type instructions so that while an intermediate one of the results of the first sequence of results may be different than a corresponding intermediate one of the results of the second sequence of results, nevertheless the first processor""s ending result for the first sequence and the second processor""s ending result for the second sequence tend to match unless one of the processors has malfunctioned.
In an alternative embodiment, the second processor executes the instructions in a sequence identical to a sequence in which the first processor executes the instructions, and the checkpoints are responsive to accumulated number of execution cycles. In this embodiment the checkpoints may also be responsive to store instructions. In one such embodiment, the checkpoints are responsive to store instructions and accumulated number of execution cycles if there has been no store instruction since a last checkpoint.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will be described herein in detail. It should be understood, however, that the drawings and detailed description are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.