Data stored in transistor integrated circuit devices is subject to spontaneous errors resulting from physical changes in the integrated circuit. These errors can be permanent errors or non-permanent errors. In the case of permanent errors, integrated circuit transistors storing individual data bits have permanently failed. Temporary errors, on the other hand, are often a result of radiation, such as cosmic radiation or radiation from decay of radioactive material. These temporary, or soft errors occur at random locations and times. Soft errors occur with greater frequency in systems that have larger and denser memory arrangements. Because it is the industry trend to include more memory capability in a smaller area, soft errors become an increasingly significant problem.
One method for detecting and correcting soft errors in stored data involves storing along with the data one or more error correction code (ECC) bits. An algorithm is used to generate an ECC word associated with a predetermined number of data bits to be stored. When the stored data bits are retrieved from a memory device, along with their associated ECC word, the ECC word is "decoded", or checked, allowing the detection of single or multiple bit errors. Errors detected this way are sometimes correctable. For example, an error involving a single bit is usually correctable, but errors involving more than one bit usually are not.
In conventional computer systems including a processor and a memory subsystem, the processor checks data requested from a memory subsystem when it is received. Errors may best be detected at this stage. Conventional computer systems also perform ECC checking at the memory subsystem side when data is written to a memory subsystem from, for example, a cache associated with a processor. This latter type of error detection often occurs asynchronously with respect to the processor instruction execution stream. In other words, in the latter type of error detection, an error is detected in data as a result of a system "housekeeping" process rather than as a result of the data being requested by a particular process. Because it is usually not possible to identify a process that requested the data in which an error has been detected in this type of error detection, handling of the error requires halting of all processes active on the system, including sometimes the processor kernel. This is equivalent to a system reset. The system reset or a halt of all active processes potentially makes the system and stored data unavailable to users for the period required to handle the error. This is especially significant in a server environment where system downtime must be minimized. For these reasons, handling errors caused by housekeeping processes such as, for example, a write-back to main memory in which the data is not actually required for use by any process, becomes extremely inefficient and wasteful.