Modern processors are vulnerable to transient faults caused by strikes from alpha particles and cosmic radiation. These faults may lead to errors in the processor's operation, known as “soft” errors since they do not reflect a permanent malfunction of the device. Strikes by cosmic ray particles, such as neutrons, are particularly noteworthy because of the absence of any practical way to protect from such strikes. The rate at which processor soft errors occur is referred to as the soft error rate (SER). Note that it is possible for some errors to corrupt data without the error being detected. The rate at which these events occur is referred to as the silent data corruption (SDC) rate.
The failure rate of a circuit is related to both the size of the transistors and the circuit supply voltage. As transistors shrink in size with succeeding technology generations, they become individually less vulnerable to cosmic ray strikes. However, this size reduction is usually accompanied by a reduction in supply voltage which increases susceptibility. Overall, decreasing voltage levels and exponentially increasing transistor counts cause chip susceptibility to increase rapidly. Additionally, error rates (measured in failures per unit time) are additive, which means that achieving a particular failure rate for a multiprocessor server requires a correspondingly lower failure rate for its associated individual processors. While possible solutions to such increasing error rates include making processor circuits less susceptible to errors, such circuit techniques cannot alleviate the problem totally, and it adds to the cost and complexity.
Similarly, fault detection support can reduce a processor's SDC rate by halting computation before faults can propagate to permanent storage. Parity, for example, is a well-known fault detection mechanism that eliminates SDC for single bit upsets in memory structures. Unfortunately, adding parity to latches or logic in a high-performance processor can adversely affect its cycle time and overall performance. Additionally, adding such codes to random logic is not straightforward and current design tools do not support such an option.
Consequently, designers have resorted to redundant execution mechanisms to detect such faults in a processor. One such mechanism is lockstepping, in which multiple cores are allocated for each program, consuming resources that could otherwise be used to boost performance, particularly in a multithreaded environment. By its very nature, both lockstepped processor cores must perform the same operation in lockstep. For example, both processors must suffer a cache miss latency or branch misprediction in lockstep, so that a checker, which checks the results generated by the lockstepped cores, does not see an output mismatch.
To make more efficient use of processor resources, another technique called Redundant Multithreading (RMT) has been proposed. RMT detects faults by running two copies of the same program as separate threads in a single core, feeding them identical inputs, and comparing their outputs. A basic RMT implementation still suffers from complexity and efficiency issues.