1. Field of the Invention
The present invention relates to the field of computer systems and more specifically to preventing data corruption due to soft errors.
2. Description of the Related Art
Soft errors in data storage elements, such as memory cells and pipeline latches, occur when incident radiation charges or discharges the storage element, thereby changing its binary state. Soft errors are increasingly a concern with smaller scale fabrication processes as the size (hence, the capacitance) of the storage elements shrink, since incident radiation will have greater effect in causing the soft errors on such smaller scale storage elements. Previously, soft errors were statistically significant only for large and dense storage structures, such as cache memories. However, the smaller feature structures are now more prone to having soft error. Structures, such as pipeline latches (particularly, wide multi-bit datapath latches), are affected to a greater extent by soft errors, where probability of occurrence of such soft errors is much more significant.
A problem with soft errors is that they have a tendency to silently corrupt data in a program. The program continues to execute, since the system is unable to detect the corrupted data, and generates an incorrect result. In some instances the system may not be able to detect the corrupted result. For example, encoding of many instructions differ in only one single bit value. A compare-equal instruction and a compare-not-equal instruction may have be different by one bit value. A probability of for data corruption is much greater in this instance. Accordingly, this type of silent data corruption (SDC) is not desirable in critical applications, such as for commercial transaction server applications, where wrong results can have broad-reaching implications.
In a high frequency, highly-pipelined machine, the pipeline latches are susceptible to such soft errors. For example, instruction data latches make up a significant portion of the total number of latches on a chip that are susceptible to soft errors. These latches, if protected, can significantly reduce the total undetected error count of a high performance processor. Thus, it is desirable to optimize the machine design to reduce, or even prevent, incorrect results due to soft errors. Since it is virtually impossible to stop all soft errors from occurring, at the very minimum it is desirable that soft errors be detected when they occur, so at least the application can be terminated and any data corruption reported. A preferable option is to detect the error when it occurs and seamlessly continue execution of the application after correcting the error.
Past practices have used parity checking and error correcting code techniques to identify the occurrence of soft errors. Past practices have also used a hardware solution, such as functional redundancy, to check for soft errors. However, this later approach typically requires significant additional circuitry on-chip to provide the error detection. None have addressed soft error detection and seamless correction, when such soft errors are detected, to the extent provided by the practice of the present invention.