The present invention relates to microprocessor design and, more particularly, to techniques for signaling errors in computer systems which implement lockstepping.
Cosmic rays or alpha particles that strike a silicon-based device, such as a microprocessor, can cause an arbitrary node within the device to change state in unpredictable ways, thereby inducing what is referred to as a “soft error.” Microprocessors and other silicon-based devices are becoming increasingly susceptible to soft errors as such devices decrease in size. Soft errors are transient in nature and may or may not cause the device to malfunction if left undetected and/or uncorrected. An uncorrected and undetected soft error may, for example, cause a memory location to contain an incorrect value which may in turn cause the microprocessor to execute an incorrect instruction or to act upon incorrect data.
One response to soft errors has been to add hardware to microprocessors to detect soft errors and to correct them, if possible. Various techniques have been employed to perform such detection and correction, such as adding parity-checking capabilities to processor caches. Such techniques, however, are best at detecting and correcting soft errors in memory arrays, and are not as well-suited for detecting and correcting soft errors in arbitrary control logic, execution datapaths, or latches within a microprocessor. In addition, adding circuitry for implementing such techniques can add significantly to the size and cost of manufacturing the microprocessor.
One technique that has been used to protect arbitrary control logic and associated execution datapaths is to execute the same instruction stream on two or more processors in parallel. Such processors are said to execute two copies of the instruction stream “in lockstep,” and therefore are referred to as “lockstepped processors.” When the microprocessor is operating correctly (i.e., in the absence of soft errors), all of the lockstepped processors should obtain the same results because they are executing the same instruction stream. A soft error introduced in one processor, however, may cause the results produced by that processor to differ from the results produced by the other processor(s). Such systems, therefore, attempt to detect soft errors by comparing the results produced by the lockstepped processors after each instruction or set of instructions is executed in lockstep. If the results produced by any one of the processors differs from the results produced by the other processors, a fault is raised or other corrective action is taken. Because lockstepped processors execute redundant instruction streams, lockstepped systems are said to perform a “functional redundancy check.”
One difficulty in the implementation of lockstepping is that it can be difficult to provide clock signals which are precisely in phase with each other and which share exactly the same frequency to a plurality of microprocessors. As a result, lockstepped processors can fall out of lockstep due to timing differences even if they are otherwise functioning correctly. In higher-performance designs which use asynchronous interfaces, keeping two different processors in two different sockets on the same clock cycle can be even more difficult.
Early processors, like many existing processors, included only a single processor core. A “multi-core” processor, in contrast, may include one or more processor cores on a single chip. A multi-core processor behaves as if it were multiple processors. Each of the multiple processor cores may essentially operate independently, while sharing certain common resources, such as a cache or system interface. Multi-core processors therefore provide additional opportunities for increased processing efficiency. In some existing systems, multiple cores within a single microprocessor may operate in lockstep with each other.
In existing systems for enabling multiple microprocessor cores to operate in lockstep, the microprocessor typically connects to a single system bus, a portion of which is shared by two or more lockstepped cores in the microprocessor. Because only one core can access the shared portion of the bus at a time in such systems, such systems typically include circuitry for arbitrating between the multiple cores and for multiplexing the data from the current “bus master” core onto the system bus. In such designs, the lockstep circuitry is typically implemented at these points of arbitration and multiplexing. Implementing lockstep circuitry in this way can be very difficult, particularly because the requirements of the bus architecture and protocol may leave very little time to perform lockstep checking. Furthermore, in such systems all data from the bus is duplicated before being transmitted to the lockstepped cores.
When a lockstep error is detected in a pair of lockstepped processor cores, it is desirable to notify the other processor cores in the system that such an error has been detected so that the other cores may disregard the output produced by the malfunctioning core or take other appropriate action. In a system in which all cores are coupled to a shared system bus, an error signal may be broadcast over the bus to all cores when a lockstep error is detected. In a link-based system, however, processor cores are connected in pairs over point-to-point links and there typically are no shared signals. As a result, typically it is not possible in such a system to use a single pin to broadcast an error message to all cores to notify them that a lockstep error has been detected. Instead, the component which identifies an error must signal the error over each point to point link, and each recipient of the error must then signal the error on each point to point link. Such error signaling can be inefficient and difficult to implement.
What is needed, therefore, are improved techniques for signaling errors in a computer system which implements lockstepping.