Memory and logic elements in a microprocessor or processor are sensitive to soft errors which can be induced by background cosmic radiation and alpha particle bombardment. A soft error is an unexpected or unwanted change in the computer system. For example, one bit in a storage element may suddenly, randomly change state from a “0” to a “1” or vice versa. Another example of a soft error is a glitch of noise inside the computer system which may get stored as if the noise were valid data. In either of these two cases, one bit becomes something other than what it is supposed to be, possibly changing an instruction in a program or data value.
Processors frequently employ parity-based mechanisms to detect data corruption due to soft errors. A parity bit is associated with each block of data when data is stored. The parity bit is set to either one or zero according to whether there is an odd or an even number of ones in the data block. When the data is read out of its storage location, the number of ones in the block is compared with the parity bit. A discrepancy between the values indicates that the data block has been corrupted. An agreement between the values indicates that either no corruption has occurred or two or more bits have been altered. Since that later event has a low probability of occurring, the parity-based mechanism provides a reliable indication as to whether data has been corrupted. An error handling mechanism is employed to either correct the detected error or minimize its impact. Soft errors may be corrected via hardware, software, or both.
A commonly used hardware error correction scheme is error correction codes (ECCs) which is a parity-based mechanism that tracks additional information for each data block. The additional information allows the corrupted bit(s) to be identified and corrected. The entire error correction process is transparent to the software that is running at the time the error occurs. While effective, a pure hardware ECC based error correction scheme is complex and inefficient to implement based on the amount of silicon area that it consumes.
Because of this, current processors utilize a prioritization scheme, a the first type of error is only detected and corrected in the processor hardware. A second type is detected in the processor and corrected in firmware. A third type of error can be detected, but not corrected, even with firmware. Finally, a fourth type of error detected by the processor, requires the processor to be rebooted. Each type of error is signaled to the processor differently to allow the processor to behave differently. This is referred as the signaling mechanism.
This prioritization scheme is not necessarily advantageous to the other features in the computer system's operating system as well as its platform (the hardware and firmware portion of the computer system other than the processors). For example, in platforms which are used in mission critical computing, the signaling mechanism for a detected error type 4 given treatment would be catastrophic. This is because these platforms desire some level of system availability and error information collection and not computer system reboot.
On the other hand, in non-mission critical computing for the low-end of the computer market (i.e., personal computers), it is acceptable for a user's computer system automatically reboots when the computer system encounters soft errors. However, each of these approaches remains independent of each other, while only addressing a subset of the problems associated with the different prioritization schemes for a computer system's processor, operating system and platform.
Accordingly, what is needed is an effective and efficient error handling mechanism that controls the processor to promote or demote the error type in a manner that is compatible with the computer system's operating system and platform.