Embodiments of the present invention relate generally to handling errors in a processor, and more specifically to handling soft errors in a merge buffer of a microprocessor.
Single bit upsets or errors from transient faults have emerged as a key challenge in microprocessor design. These faults arise from energetic particles—such as neutrons from cosmic rays and alpha particles from packaging material—generating electron-hole pairs as they pass through a semiconductor device. Transistor source and diffusion nodes can collect these charges. A sufficient amount of accumulated charge may change the state of a logic device—such as a static random access memory (SRAM) cell, a latch, or a gate—thereby introducing a logical error into the operation of an electronic circuit. Because this type of error does not reflect a permanent failure of the device, it is termed a soft or transient error.
Soft errors become an increasing burden for microprocessor designers as the number of on-chip transistors continues to grow. The raw error rate per latch or SRAM bit may be projected to remain roughly constant or decrease slightly for the next several technology generations. Thus, unless error protection mechanisms are added or more robust technology (such as fully-depleted silicon-on-insulator) is used, a microprocessor's soft error rate may grow in proportion to the number of devices added to semiconductor devices in each succeeding generation.
Bit errors may be classified based on their impact and the ability to detect and correct them. Some bit errors may be classified as “benign errors” because they are not read, do not matter, or they can be corrected before they are used. The most insidious form of error is silent data corruption, where an error is not detected and induces the system to generate erroneous outputs. To avoid silent data corruption, designers may employ error detection mechanisms, such as parity. Error correction techniques may be employed to fix detected errors, although such techniques may not be applied in all situations. The ability to detect an error but not correct it may avoid generating incorrect outputs (by shutting down the affected processes before incorrect outputs are generated), but it may not provide a mechanism to recover and continue executing the affected processes when such an error occurs. Errors in this category may be called detected unrecoverable errors (DUE).
DUE events may be further subdivided according to whether the DUE event results in the operating system and/or another mechanism killing one or more user processes that were impacted by the error or whether the DUE event results in crashing the entire machine, including all of its processes, to prevent data corruption. The first type may be called a “process-kill DUE” event. The second type may be called a “system-kill DUE” event. A process-kill DUE is preferable over a system-kill DUE because a process-kill DUE allows the system to continue running and servicing the processes not affected by the error. For example, large-scale computer systems may execute hundreds of processes at a time. Therefore, isolating a transient error to one process (or a small set of processes) and killing just that process (or small set of processes) would provide a substantial advantage over crashing the entire system and killing all of the processes then being executed.
Thus a need exists for converting merge buffer system-kill errors to process-kill errors.