Transient errors, often referred to as soft errors, are an increasing source of errors in processors. Because of the decreased size of devices and reduced voltages at which they operate, these devices are more vulnerable to cosmic particle strikes and parameter variations. Such events can lead to transient errors that occur randomly and can affect proper execution of a processor. With each generation of semiconductor manufacturing technology, susceptibility to soft errors is expected to increase.
Certain mechanisms have been used to attempt to correct soft errors. Typically, these measures include providing redundant paths for redundant operations on data. However, such redundant paths can significantly increase the size and power consumption of a processor, leading to performance degradation. Furthermore, some approaches use simultaneous multithreading (SMT) to detect errors. In such approaches, a process is scheduled on two separate execution paths (e.g., two threads in a SMT core). The resulting data are then compared for identity. If the results differ, this is an indication of a soft error, and the error is detected. However, performance degradation is significant, since some hardware is devoted to error detection instead of running other processes and complexities exist in supporting result comparison and thread coordination.
While some processor designs have focused on protecting the datapath, caches, and main memories, register files (RFs) have been largely neglected. RFs are accessed very frequently (and thus the probability of errors that propagate to the output of a program may increase). While adding parity to stored values may enable error detection, correction is only possible if the instruction producing the corrupted value has not left the pipeline. On the other hand, error correction coding (ECC) may enable error detection and correction, but only at a high cost in terms of area and power. Over-estimation of soft errors can result in over-design of protection mechanisms, which will eventually increase the reliability cost. On the other hand, insufficient protection against soft errors may cause a system to be unreliable.