At least two types of computing faults may be of concern to computer system designers. A first fault may be the failure of hardware, such as the failure of a processor or an unrecoverable memory error. A second fault may be a computational fault, such as may be caused by cosmic radiation changing the state of a bit in hardware. In order for the computer system to remain operational following a failure of hardware or to detect and recover from a computational fault, some computing systems have multiple processors executing the same software applications. In the event one of the processors experiences a failure of hardware, the computing continues with the one or more processors still functioning properly. Comparison of outputs of the multiple processors may allow detection and correction of computational faults.
In some cases the processors executing the same software application operate in cycle-by-cycle or strict lock-step, each processor provided a duplicate clocking signal and executing cycle-by-cycle the same software code. While processor clocking frequencies have increased, so too has the die size. Increased clock frequency, in combination with larger die sizes, makes it difficult to control phase differences in the clocking signals of computer systems, and therefore also difficult to implement strict lock-step. Further difficulties may include handling of recoverable errors (soft errors) that occur in one processor, but not others. To address these difficulties, some computer manufacturers may implement loose lock-step systems where processors execute the same code, but not necessarily in a cycle-by-cycle fashion or at the same wall clock time. In order to ensure that processors executing the same code do not get too far removed from one another, these systems count executed instructions and, after expiration of a predetermined number of instructions, synchronize by stalling the faster processor to allow the slower processor to catch up.
However, emerging technology in processor design allows non-deterministic processor execution. Non-deterministic processor execution may mean that multiple processors provided the same software application instructions will not necessarily execute the instructions in the same order, or using the same number of steps. The differences may be attributable to advances such as speculative execution (such as branch prediction), out of order processing, and soft error recovery implemented within the processor. Thus, two or more processors executing the same software application may not perform precisely the same sequence of instructions, and therefore strict lock-step fault tolerance, as well as loose lock-step fault tolerance relying on counting of retired instructions, may not be possible.