Energetic subatomic particles such as neutrons from cosmic rays and alpha particles from radioactive trace components in the packaging of a semiconductor device may generate electron-hole pairs as they pass through such a device. Transistor source and drain terminals in the device may collect these charges and eventually a sufficient accumulation of charge may cause a logic device incorporating such a transistor to invert state or flip, introducing a logic fault into the circuit's operation. These faults are transient, because they are not a permanent failure of the device, and are therefore termed soft or transient errors. A common form of soft error is an error in a transistor that forms part of a memory cell such as a cache cell or register cell, causing a bit represented by such a cell to be flipped from its intended value.
The likelihood of a soft error affecting a processor or other semiconductor device depends on the number of on-chip transistors. In the case of processors, particularly, the number of on-chip transistors has grown very rapidly and therefore the error rate due to soft errors has grown in proportion. Therefore the importance of reducing the impact of soft errors on processor operation has increased in importance.
FIG. 1 illustrates a classification of soft errors in a processor memory unit such as a register or cache, depicted as a flowchart for clarity. When a soft error occurs, 110, the fault may be considered benign if the affected bit has not been read, 120 and 140. If the bit was read, but the affected unit, such as a cache line or register bank, has error protection built in, 130, the error may be recoverable or at least, detectable. Such error correction is well known and includes for example, parity and ECC schemes. In the situation where a bit does not have error protection, and the bit affects the correctness of any computation underway in the processor, a silent data corruption 180 is said to have occurred. This is an undesirable state that processor designers attempt to minimize in terms of its likelihood.
If the error is detected and can be corrected, 150, then the bit is set or reset to its original value and processor operation continues normally, 190. If the error cannot be corrected, but has been detected, the processor may take additional action because such an error is considered unrecoverable, 170. This type of error is termed a detected unrecoverable error or DUE.
Generally, a DUE results in an error-caused termination of at least the executing process which attempted to read the erroneous bit and sometimes an error-caused termination of the entire operating system running on the processor causing a machine halt or restart. It is, of course, preferable to terminate one process as opposed to the entire system so as to minimize the overall impact of the DUE. In highly reliable systems such as critical use servers, designers attempt to ensure that the mean time between system-terminating DUEs is very high, e.g., 25 years.
When a DUE is detected, the processor generally enters a software error handling routine. Using register error logs, the routine determines whether the DUE warrants a process or a system termination and how to proceed. In one scenario, a second DUE may occur during the execution of the error handling routine for a first DUE. While such an occurrence is relatively unlikely, a designer of a high reliability processor may need to consider this scenario.