The present invention relates to computer systems and more particularly to error detection of data stored in a memory region and handling by a processor.
Computer systems, from handheld electronic devices to medium-sized mobile and desktop systems to large servers and workstations, are becoming increasingly pervasive in our society. Each computer system includes one or more processors. A processor manipulates and controls the flow of data in a computer. Improving processor reliability and data integrity tends to improve the overall quality of the computer. Processor designers employ many different techniques to achieve these goals to create more robust computers for consumers.
One reliability problem arises from occurrences known as soft errors. A soft error is a situation in which a bit is set to a particular value in the processor, and the bit spontaneously changes to the opposite value, thereby making the associated data erroneous. A soft error may be caused by cosmic rays passing through a storage element within the processor. These rays may charge or discharge the storage element, causing a stored bit to change its value.
As processor supply voltages continue to be reduced in an effort to reduce device geometry to increase speed and packing density, the difference in voltage values that define the 1""s and 0""s of bits is reduced as well. This makes processors more susceptible to soft errors. In addition, as storage elements become more densely packed within processors, the likelihood of a soft error increases.
One way to combat soft errors is through the use of error correction code (ECC). ECC detects errors in data, and in some cases is able to correct those errors. For example, one type of ECC is capable of correcting single bit errors, but can only detect double bit errors (and cannot correct them).
Because ECC is limited in its ability to correct multi-bit errors, ECC typically relies on the computer system software or hardware to take the necessary precautions when the ECC detects an uncorrectable, multi-bit error in data. For example, upon detecting uncorrectable, erroneous data, the ECC causes a system-wide reset which terminates all processes and shuts down the system. This is done to prevent the uncorrectable, erroneous data from corrupting the rest of the system.
In accordance with an embodiment of the present invention, an uncorrectable error is detected in the data of a computer system. The erroneous data is allowed to be stored in first and second caches of the computer system while the system runs first and second processes, the first process being associated with the data. The first process is terminated when an attempt is made to load the data from the cache. Meanwhile, the second process continues to run.
Other features and advantages of the present invention will be apparent from the accompanying figures and the detailed description that follows.