1. Field of the Invention
This invention relates to the field of self-repair of microprocessor array structures and, more particularly, to the field of masking hard faults in microprocessor array structures.
2. Description of the Related Art
In computer hardware, “hard faults” are not uncommon. Hard faults are distinguishable from “soft” or transient faults by their permanence. A hard fault is a permanent error condition that remains fixed, for example, a location on a hard drive that stores a digital “1” regardless of attempts to store something else (e.g., a digital “0”) to the location. Unlike soft faults, which are transient and can be reset, a hard fault cannot be changed. As a result, they are particularly troublesome to both software and hardware designers.
As microprocessor fabrication technology continues to shrink devices and wires and increase clock frequencies, hard fault rates are consequently increasing. One reason for the increase in hard faults is the increased probability of short and open circuits due to reduced circuit sizes. These reduced circuit dimensions result in devices with increased sensitivity to effects such as electromigration and gate oxide breakdown, both sources of hard faults in a device.
There are several existing techniques for comprehensively tolerating hard faults in microprocessor cores. The simplest approach is forward error recovery (FER) via the use of redundant microprocessors in parallel, e.g., “pair and spare” or triple modular redundancy (TMR). For extreme reliability, this is an effective but not cost-efficient solution. IBM mainframes and certain systems built by Tandem and Stratus are examples of systems that use redundant processors to mask hard faults. Mainframes also replicate certain structures within the microprocessors themselves to increase reliability. The drawback of these schemes is the large added hardware expense and power usage of the redundant hardware. For non-mission-critical applications, this solution is not preferred.
Cost-effective approaches exist for comprehensively tolerating hard faults and can be far less expensive than the above-described redundancy approaches, but they often sacrifice performance in the presence of hard faults. One such approach is back-end or commit-stage error detection with backward error recovery (BER), which use end-of-pipeline checker processors to perform the detection and trigger recovery operations. Dynamic Implementation Verification Architecture (DIVA) is one example of this approach and is used to provide fault protection for traditional microprocessor core implementations. The processors utilized in these traditional microprocessors must be fast and aggressive to perform the complex operations that they are tasked to perform. DIVA and other similar systems utilize in-order technology, using a small, simple, on-chip checker processor, to protect the microprocessor from both hard and soft faults. The checker processor sits at the commit stage of the microprocessor and compares the results of its execution of each instruction to the result of execution by the microprocessor. If the results differ, the checker processor is assumed to be correct and its result is used. This assumption is based on the provably correct design of the checker processor and its relatively small size and complexity with respect to the more aggressive microprocessor. To prevent the fault in the microprocessor from propagating to later instructions, DIVA then flushes the aggressive processor's pipeline, which effectively backs processing up, on the order of a few tens of instructions, to make certain that any in-core forwarding of the faulty value is nullified and replayed with the correct value from the checker. On the replay, the correct value won't need to be forwarded in the microprocessor core because it will already be ready in the register file and will be fetched from there.
The fault-free performance of DIVA and other checker processor systems can be made virtually equal to that of the aggressive processor, since the simple checker processor can leverage the faster microprocessor as a pre-fetch engine. The small amount of redundancy of a checker processor system such as DIVA is far less expensive and power hungry than TMR. However, such systems have a performance penalty for each detected error. Every time a hard fault manifests itself as an error, the performance of the system temporarily degenerates to that of the checker processor until the microprocessor refills its pipeline. The checker processor is very slow; performance will degrade appreciably for error rates greater than one per thousand instructions. In the presence of hard faults that could get exercised frequently, performance suffers significantly.
Cost-effective approaches for tolerating only specific classes of hard faults also exist. One approach is the use of error correcting codes (ECC). ECC can tolerate up to a targeted number of faulty bits in a piece of data, and it is a useful technique for protecting SRAM, DRAM, buses, etc., from this fault model. However, ECC cannot tolerate more than a certain number of faulty bits, nor can it be implemented quickly enough to be a viable solution for many performance-critical structures in a microprocessor.