The present invention relates to highly reliable processor implementations and architectures, and in particular, to processor implementations and architectures that rely on an operating system (OS) for error recovery.
All semiconductor integrated circuits, including microprocessors, are subject to soft errors, which are caused by alpha particle bombardment and gamma ray radiation. If left undetected, these soft errors can cause data corruption, leading to undefined behaviors in computer systems. To combat problems caused by these soft errors, many microprocessors today use parity or Error Correcting Code (ECC) check bits to protect the critical memory structures inside the chips. While parity protection allows soft errors to be detected only, ECC can both detect and correct the errors, however, the correction hardware is often expensive in terms of the silicon area that it consumes and the timing impact that it has on the final operation frequency of the processor. For this reason, this extra correction hardware is often not implemented. Alternatively, many hardware implementations have used a hybrid scheme in which more performance sensitive errors have been corrected fully in the hardware while less performance sensitive ones have been handled in software. So, with both parity and ECC protection schemes, there is a desire to implement an efficient software error correction scheme.
In a typical software error correction scheme, whenever a soft error is detected by the hardware, execution control is transferred to an error handler. The error handler can then terminate the offending process (or processes) to contain the error and minimize its impact. After the error is handled by the error handler, the terminated process (or processes) can be restarted. In this way, since only the offending process (or processes) is (are) affected, the system remains intact.