Radiation-induced soft errors, caused primarily by neutron particles, have become a major problem for processor designers. Because this type of error does not reflect a permanent failure of the device, it is termed a soft or transient error. These bit upsets from transient faults are in addition to those caused by alpha particles from packaging material and bumps. It is expected that the exponential increase in the number of transistors on a single chip and aggressive voltage scaling will make this problem significantly worse in future generations of chips.
To address cosmic ray strikes, some approaches seek to protect a large percentage of total latches in a processor or other semiconductor device with some form of error detection, such as parity. Similarly, most major arrays such as caches and register files in high-performance microprocessors have some form of error detection and recovery. As more transistors are added to a single chip it becomes even more challenging to maintain the same level of reliability in succeeding generations of processors.
Reliability is measured in failures in time (FIT), where one FIT represents one failure in one billion hours of operation. There are three main components of FIT: the intrinsic error rate of the circuit, which is a function of the manufacturing process and clocking schemes; the number of bits in the microprocessor, which is a design parameter; and the architectural vulnerability factor (AVF), which is the probability that a bit flip results in a user-visible error. A user-visible error is defined as any bit corruption which reaches the pins of the microprocessor and escapes to main memory or an input/output (I/O) device. Of these three components of FIT, the AVF is the only one that can vary significantly over time. Indeed, studies have shown that AVF can vary greatly (by over 90% in cases), from one program to another on average. AVF can vary significantly within a program as well, when measured in real-time over small periods of time known as quanta, instead of averaged over long runs.
Most architectural and microarchitectural error detection/recovery mechanisms attempt to reduce the average AVF of the microprocessor, thereby improving the overall reliability. This improved reliability however, comes at a cost in power and performance. Schemes such as parity prediction and residue which are primarily used to protect execution units can have a high power cost. Microarchitectural redundancy schemes such as redundant execution can have both a power cost and a performance cost since execution units which could be used to compute two different instructions in parallel are used to compute a single instruction redundantly. Most error mitigation schemes are always active, since there is no current reliable way to measure the real-time AVF during program execution. As a result, the power and performance costs for such mechanisms are a fixed penalty.