1. Field of the Invention
The present invention relates to the design of multiprocessor systems. More specifically, the present invention relates to a method and an apparatus for facilitating fault-tolerance by comparing addresses and data from redundant processors running in lock-step.
2. Related Art
As microprocessor systems become increasingly faster and more complex, larger numbers of circuit elements are being pushed to run at faster and faster clock rates. This increases the likelihood that transient errors will occur during program execution, and thereby reduces the reliability of microprocessor systems.
Error-correcting codes can be employed to correct transient errors that occur when data is stored into memory. However, such error-correcting codes cannot correct all types of errors, and furthermore, the associated circuitry to detect and correct errors is impractical to deploy in extremely time-critical computational circuitry within a microprocessor.
Transient errors can also be detected and/or corrected by replicating a computer system so that there exist two or more copies of the computer system concurrently executing the same code. This allows transient errors to be detected by periodically comparing results produced by these replicated computer systems.
Transient errors can be corrected in a replicated computer system by voting. If there are three or more replicated computer systems and an error is detected, the computer systems can vote to determine which result is correct. For example, in a three-computer system, if two of the three computers produce the same result, this result is presumed to be the correct answer if the other computer system produces a different result.
However, replicating entire computer systems can be expensive, especially if the entire system memory has to be replicated.
What is needed is a method and an apparatus for providing fault-tolerance without replicating entire computer systems.
Another problem with using replicated (redundant) computer systems to provide fault-tolerance is that existing cache-coherence mechanisms can interfere with the task of keeping all of the replicated processors in the same state.
For example, a common multiprocessor design includes a number of processors 151-154 with a number of level one (L1) caches, 161-164, that share a single level two (L2) cache 180 and a memory 183 (see FIG. 1). During operation, if a processor 151 accesses a data item that is not present in its local L1 cache 161, the system attempts to retrieve the data item from L2 cache 180. If the data item is not present in L2 cache 180, the system first retrieves the data item from memory 183 into L2 cache 180, and then from L2 cache 180 into L1 cache 161.
Note that coherence problems can arise if a copy of the same data item exists in more than one L1 cache. In this case, modifications to a first version of a data item in L1 cache 161 may cause the first version to be different than a second version of the data item in L1 cache 162.
In order to prevent such coherency problems, these computer systems typically provide a coherency protocol that operates across bus 170. A coherency protocol typically ensures that if one copy of a data item is modified in L1 cache 161, other copies of the same data item in L1 caches 162-164, in L2 cache 180 and in memory 183 are updated or invalidated to reflect the modification. This is accomplished by broadcasting an invalidation message across bus 170.
However, note that this type of coherency mechanism can cause replicated processors to have different states in their local L1 caches. For example, if a first replicated processor updates a data item in L1 cache, it may cause the same data item to be invalidated in the L1 cache of second replicated processor. In this case, the L1 cache of the first replicated processor ends up in a different state than the L1 cache of the second replicated processor.
What is needed is a method and an apparatus for providing fault-tolerance through replicated processors, without the side-effects caused by a cache-coherence mechanism.