1. Field of the Invention
The present invention relates to techniques for reducing the frequency of soft errors in an integrated circuit.
2. Related Art
Cosmic rays or alpha particles that strike a silicon-based device, such as a microprocessor, can cause an arbitrary node within the device to change state in unpredictable ways, thereby inducing what is referred to as a “soft error.” Microprocessors and other silicon-based devices are becoming increasingly susceptible to soft errors as such devices decrease in size. Soft errors are transient in nature and may or may not cause the device to malfunction if left undetected and/or uncorrected. An uncorrected and undetected soft error may, for example, cause a memory location to contain an incorrect value which may in turn cause the microprocessor to execute an incorrect instruction or to act upon incorrect data.
A single undetected and undetected soft error can cause a computer system to crash or otherwise become unusable until the error is corrected or the system is reset and brought back to a normal operating condition. Such error recovery may require several hours or even several days, depending on the nature of the error and the size and complexity of the computer system. Each hour of downtime may cause the computer system owner or operator to incur costs in the thousands, or even millions, of dollars depending on the size of the enterprise serviced by the computer system. Furthermore, service providers which guarantee minimum levels of service for covered computer systems (such as a guaranteed maximum number of hours to recover from any error) are also put at risk by soft errors which are difficult to detect and correct quickly. There is, therefore, a significant need to prevent the occurrence of soft errors and to detect and correct such errors quickly when they occur.
One response to soft errors has been to add hardware to microprocessors to detect and correct such errors. Various techniques have been employed to perform such detection and correction, such as adding parity-checking capabilities to processor caches. Such techniques, however, are best at detecting and correcting soft errors in memory arrays, and are not as well-suited for detecting and correcting soft errors in arbitrary control logic, execution datapaths, or latches within a microprocessor. In addition, adding circuitry for implementing such techniques can add significantly to the size and cost of manufacturing the microprocessor.
One technique that has been used to protect arbitrary control logic and associated execution datapaths is to execute the same instruction stream on two or more processors in parallel. Such processors are said to execute two copies of the instruction stream “in lockstep,” and therefore are referred to as “lockstepped processors.” When the microprocessors are operating correctly (i.e., in the absence of soft errors), all of the lockstepped processors should obtain the same results because they are executing the same instruction stream. A soft error introduced in one processor, however, may cause the results produced by that processor to differ from the results produced by the other processor(s). Such systems, therefore, attempt to detect soft errors by comparing the results produced by the lockstepped processors after each instruction or set of instructions is executed in lockstep. If the results produced by any one of the processors differs from the results produced by the other processors, a fault is raised or other corrective action is taken. Because lockstepped processors execute redundant instruction streams, lockstepped systems are said to perform a “functional redundancy check.”
One problem with lockstep-based techniques is that they typically incur a performance penalty of at least 50% by requiring all instructions to be executed in duplicate. Systems which employ lockstepping therefore obtain increased reliability at the expense of performance. Furthermore, lockstepped systems can be difficult and costly to implement due to the additional circuitry required.
What is needed, therefore, are improved techniques for reducing the frequency of soft errors in computing systems.