Processors, like other integrated circuits, are vulnerable to transient faults caused by strikes from alpha particles and cosmic radiation. These faults may lead to errors in the processor's operation, known as “soft” errors since they do not reflect a permanent malfunction of the device. Strikes by cosmic ray particles, such as neutrons, are particularly noteworthy because of the absence of any practical way to protect from such strikes. The rate at which processor soft errors occur is referred to as the soft error rate (SER). Note that it is possible for some errors to corrupt data without the error being detected. The rate at which these events occur is referred to as the silent data corruption (SDC) rate.
The failure rate of a circuit is related to both the size of the transistors and the circuit supply voltage. As transistors shrink in size with succeeding technology generations, they become individually less vulnerable to cosmic ray strikes. However, this size reduction is usually accompanied by a reduction in supply voltage which increases susceptibility. Overall, decreasing voltage levels and exponentially increasing transistor counts cause chip susceptibility to increase rapidly. Additionally, error rates (measured in failures per unit time) are additive, which means that achieving a particular failure rate for a multiprocessor server requires a correspondingly lower failure rate for its associated individual processors.
Similarly, fault detection support may reduce a processor's SDC rate by halting computation before faults can propagate to permanent storage. Parity, for example, is a well-known fault detection mechanism that eliminates SDC for single bit upsets in memory structures. Unfortunately, adding parity to latches or logic in a high-performance processor can adversely affect its cycle time and overall performance. Additionally, adding such codes to random logic is not straightforward and current design tools do not support such an option.