The present invention relates to computer system design and, more particularly, to techniques for decreasing the susceptibility of a computer system to soft errors.
Cosmic rays or alpha particles that strike a silicon-based device, such as a microprocessor, can cause an arbitrary node within the device to change state in unpredictable ways, thereby inducing what is referred to as a “soft error.” Microprocessors and other silicon-based devices are becoming increasingly susceptible to soft errors as such devices decrease in size. Soft errors are transient in nature and may or may not cause the device to malfunction if left undetected and/or uncorrected. An uncorrected and undetected soft error may, for example, cause a memory location to contain an incorrect value which may in turn cause the microprocessor to execute an incorrect instruction or to act upon incorrect data.
One response to soft errors has been to add hardware to microprocessors to detect soft errors and to correct them, if possible. Various techniques have been employed to perform such detection and correction, such as adding parity-checking capabilities to processor caches. Such techniques, however, are best at detecting and correcting soft errors in memory arrays, and are not as well-suited for detecting and correcting soft errors in arbitrary control logic, execution datapaths, or latches within a microprocessor. In addition, adding circuitry for implementing such techniques can add significantly to the size and cost of manufacturing the microprocessor.
One technique that has been used to protect arbitrary control logic and associated execution datapaths is to execute the same instruction stream on two or more processors in parallel. Such processors are said to execute two copies of the instruction stream “in lockstep,” and therefore are referred to as “lockstepped processors.” When the microprocessor is operating correctly (i.e., in the absence of soft errors), all of the lockstepped processors should obtain the same results because they are executing the same instruction stream. A soft error introduced in one processor, however, may cause the results produced by that processor to differ from the results produced by the other processor(s). Such systems, therefore, attempt to detect soft errors by comparing the results produced by the lockstepped processors after each instruction or set of instructions is executed in lockstep. If the results produced by any one of the processors differs from the results produced by the other processors, a fault is raised or other corrective action is taken. Because lockstepped processors execute redundant instruction streams, lockstepped systems are said to perform a “functional redundancy check.”
One difficulty in the implementation of lockstepping is that it can be difficult to provide clock signals which are precisely in phase with each other and which share exactly the same frequency to a plurality of microprocessors. As a result, lockstepped processors can fall out of lockstep due to timing differences even if they are otherwise functioning correctly. In higher-performance designs which use asynchronous interfaces, keeping two different processors in two different sockets on the same clock cycle can be even more difficult.
Early processors, like many existing processors, included only a single processor core. A “multi-core” processor, in contrast, may include one or more processor cores on a single chip. A multi-core processor behaves as if it were multiple processors. Each of the multiple processor cores may essentially operate independently, while sharing certain common resources, such as a cache or system interface. Multi-core processors therefore provide additional opportunities for increased processing efficiency. In some existing systems, multiple cores within a single microprocessor may operate in lockstep with each other.
Although operating multiple processors or processor cores in lockstep increases the computer system's reliability by eliminating or mitigating the effects of soft errors, such increased reliability typically comes at the price of decreased performance. Because a pair of processors operating in lockstep with each other can only execute a single (duplicated) instruction stream, such a pair of lockstepped processors provides at most 50% of the throughput of a pair of non-lockstepped processors which process two distinct instruction streams in parallel.
What is needed, therefore, are techniques for providing the increased reliability afforded by lockstepped instruction execution without incurring the performance penalty typically incurred by systems which implement lockstepped instruction execution.