Embodiments of the present invention relate generally to handling errors in a processor.
Single bit upsets or errors from transient faults have emerged as a key challenge in microprocessor design. These faults arise from energetic particles—such as neutrons from cosmic rays and alpha particles from packaging material—generating electron-hole pairs as they pass through a semiconductor device. Transistor source and diffusion nodes can collect these charges. A sufficient amount of accumulated charge may change the state of a logic device—such as a static random access memory (SRAM) cell, a latch, or a gate—thereby introducing a logical error into the operation of an electronic circuit. Because this type of error does not reflect a permanent failure of the device, it is termed a soft or transient error.
Soft errors become an increasing burden for microprocessor designers as the number of on-chip transistors continues to grow. The raw error rate per latch or SRAM bit may be projected to remain roughly constant or decrease slightly for the next several technology generations. Thus, unless error protection mechanisms are added or more robust technology (such as fully-depleted silicon-on-insulator) is used, a microprocessor's soft error rate may grow in proportion to the number of devices added to semiconductor devices in each succeeding generation.
Bit errors may be classified based on their impact and the ability to detect and correct them. Some bit errors may be classified as “benign errors” because they are not read, do not matter, or they can be corrected before they are used. The most insidious form of error is silent data corruption, where an error is not detected and induces the system to generate erroneous outputs. To avoid silent data corruption, designers often employ error detection mechanisms, such as parity. Error correction techniques may also be employed to fix detected errors, although such techniques cannot be applied in all situations. The ability to detect an error but not correct it may avoid generating incorrect outputs (by shutting down the affected processes before incorrect outputs are generated), but it may not provide a mechanism to recover and continue executing the affected processes when such an error occurs. Errors in this category may be called detected unrecoverable errors (DUE, or DUE errors, or DUE events).
DUE errors may be further subdivided according to whether the DUE error results in the operating system and/or another mechanism killing one or more user processes that were impacted by the error or whether the DUE error results in crashing the entire machine, including all of its processes, to prevent data corruption. The first type may be called a “process-kill DUE” error. The second type may be called a “system-kill DUE” error. A process-kill DUE is preferable over a system-kill DUE because a process-kill DUE allows the system to continue running and servicing the processes not affected by the error.
To address soft errors introduced by transient faults, microprocessor designers may include a variety of error protection features. Examples of protection features that may be used are parity, error correcting code (ECC), cyclic redundancy checking (CRC), lockstepping, radiation-hardened cells, and silicon-on insulator manufacturing technology.
Error protection features may also be included in software. Some software programs may involve extremely complex computations that may run for weeks or months on even the fastest available computers. To reduce the impact of hardware errors (that may crash programs or entire systems), some programs may implement error recovery techniques, such as application-level checkpointing, to avoid losing all their intermediate computations if the program or system crashes before the final computations are completed. Checkpointing may be added to an application program or process so that the program periodically saves its own state. Then if an error, such as a process-kill DUE, results in the application program or process being killed, halted, or shut down, recovery may be made by restarting execution of the application program or process from the checkpoint.
Upon encountering a process-kill DUE error, conventional computer systems inform the operating system, which may have no option but to kill the program(s) affected by the error. Unfortunately, conventional computer systems do not provide a way for a hardware error, such as a process-kill DUE, to be vectored back to an application-level process to allow the application program to trigger or handle its own recovery. Thus, when an application program crashes, valuable computing time may be lost waiting for a user to intervene and restart the program.
A need thus exists to vector process-kill errors to an application program.