1. Field of the Invention
The invention relates to error detection and, in particular, residue-based error detection.
2. Description of the Related Art
With computers being part of every day life and critical for business, the microprocessors' reliability is a critical design requirement. This reliability, usually expressed as MTBF (Mean Time Between Failures), indicates both the error rate of the microprocessor and the capability of the microprocessor to survive many of these errors. Processor errors can be classified into two categories: 1) soft or transient errors and 2) hard or permanent errors. The error rate is reported in FITs (failures in time), where one FIT specifies one failure in a billion hours of operation. As a frame of reference, a FIT rate of 114,000 FIT for a component (a microprocessor, for example) indicates that the component has an MTBF of one year. However, if a company sells 1000 microprocessors to a customer, the microprocessor should have a FIT rate of about 114 FIT in order for that customer to experience an average of one microprocessor failure per year for this lot.
High energy particles from cosmic radiation or alpha particles in packaging materials cause soft errors affecting electronic components. Such radiation events can cause charge collection at sensitive circuit nodes, corrupting those circuit states without causing permanent damage. The radiation events primarily affect storage elements (RAM cells, latches, flip-flops) which hold a state (bit values) for a relatively long time. Radiation events and altitude affect soft error rates of different storage elements. In addition, soft error rates (SER) depend on voltage and circuit characteristics.
Combinatorial logic can be affected if a soft error occurs in the window that would cause the corrupted value to be captured in that logic's latches or flip-flops. For static logic, this window is very narrow and the logic is built with rather large transistors, which can better fight the spurious charge collected due to radiation events. For dynamic logic, the window is wider (equal to the evaluation stage of the logic) and the charge after prefetch is preserved by a half-latch (“keeper” logic). Hence, this logic is significantly more sensitive to radiation events than static logic (although less sensitive than storage elements because of the refresh due to precharge).
Left uncorrected, soft errors induce an error rate which is higher than all other reliability mechanisms. For modern microprocessors, which have large SRAM elements (mostly large caches) and are implemented in deep sub-micron technologies, the error rate, which is dominated by single bit upsets, continues to grow with the increased number of bits in each technology generation. If the single bit upsets in SRAMs are left uncorrected, the reliability (MTBF) of these microprocessors becomes unacceptable. This is the reason why most modern microprocessors implement error detection and correction (EDC) mechanisms (at least) for their caches. These mechanisms are capable of detecting and correcting single bit upsets. It has been observed in chip multi-threading microprocessors that adding EDC to the caches reduces the failure rate due to soft errors (improves MTBF) by over two orders of magnitude. With the single bit upsets for large storage elements out of the way, the failure rate due to soft errors (FRSE) is dominated by the SER of smaller, unprotected storage structures, like register files, buffers and queues etc., as well as the SER of the flip-flops and latches in the microprocessor's logic.
For modern microprocessors that correct single bit upsets in their caches, the hard error rate becomes another significant reliability component. The hard errors, which are the result of either process or manufacturing defects, or of processor wear-out (electromigration, thermal cycling etc.), are becoming more frequent as microprocessors are implemented in ever denser, deep sub-micron technologies. The main reasons for this are increased power densities in transistors and interconnect due to smaller device and interconnect geometries, higher transistor count, power management techniques that might result in thermal cycling, etc. As the hard errors reflect failures in the chip's transistors and interconnect, the hard error rate of a block is proportional with that block's area.
For correctable errors, the error detection mechanisms in a microprocessor usually differentiate between soft and hard errors based on the success of the correction mechanism to recover from the error. All detected errors are, normally, communicated to software by either interrupts (in the case of errors corrected by special hardware mechanisms, as described below) or by traps (in the case of errors corrected by instruction retry initiated by hardware or software). Typically, the software tallies the different errors and, if a certain error occurs more than a preset number of times, then that error is declared a hard error and treated accordingly. The software could also keep track of errors in different components for preventive maintenance, in order to identify and report the components with error rates above an acceptable threshold.
A microprocessor's errors can be classified as a function of the existence of EDC mechanisms for that error. Errors can be classified into the following four main classes:
1. Detected and correctable errors: the error can be detected and the correct value can be recovered. This type of coverage can be achieved by error correction codes (ECC), by parity or residue detection of errors in write-through caches (the parity or residue error forces a miss and, as a result, a refresh of the cache line) or by error detection (parity, ECC, residue, etc.) in storage structures that do not hold architectural state, in logic gates or in flip-flops (if covered). The error correction for these soft errors is done by either hardware or software. In hardware, the error correction is done by either special state machines (e.g. correcting and writing back a dirty line with a single-bit error in a write-back cache before returning the corrected data to the pipeline) or by clearing the pipeline when an instruction with an error tries to commit and re-executing the instructions pending in the pipeline, beginning with the instruction affected by the error. In the case of software correction, the error usually causes a precise trap when the first instruction affected by the errors tries to commit. The trap's service routine can then correct the error using processor hardware that allows it to access the storage elements affected by the error.2. Detected and uncorrectable errors (DUE errors): the error is detected, but cannot be corrected, resulting, in some systems, in an application or system crash. Parity errors, ECC-detected multi-bit errors in write-back caches, or residue errors in an architectural register are examples of such detected, but uncorrectable errors (at least not correctable by the detecting mechanism).3. Undetected and unimportant errors: while an error occurred, it affected a structure which is part of speculation, so it does not impact correctness (e.g. a branch predictor). Actually those errors are detected and corrected as part of the normal processor functionality of checking the correctness of the speculation, so the error recovery is indistinguishable from recovering from a wrong speculation.4. Undetected and uncorrectable errors: an error occurred, but was undetected and caused silent data corruption. These are also known as Silent Data Corruption (SDC) errors. SDC errors can affect the processor state for a significant amount of time without being detected. They are considered the most dangerous type of errors and should be eliminated as much as possible.
Error detection is the most important reliability function in a microprocessor, as an undetected error could silently corrupt the system's state, with potentially grave consequences. Microprocessors designed for mission critical systems, for servers, etc., invest a large percentage of their area and power budgets for error detection and, when possible, correction to avoid faults from SDC errors. IBM's G4/G5 microprocessors have two identical copies of the pipeline (the I-unit and E-unit), sharing the first-level caches, which are parity protected. Pipeline errors are detected by comparing the results from the two I- and E-units. The arrays holding the processor state (register files, store buffer) are ECC protected. In case of an uncorrectable or hard error the G5 processor signals the operating system to transfer the state of the failed processor to the dispatch queue of a different processor in the system. The failed processor is taken out of the active configuration and the task it was executing is restarted on the new processor with the proper priority. Duplicating the I- and E-units improves error detection, but at a high price in area and power of about 35% chip area overhead.
Fujitsu's 5th generation SPARC64 microprocessor achieves error detection by using a variety of error detection mechanisms like parity and ECC on the caches, register files, parity prediction and checking for ALUs and shifters, 2-bit residue checker for the multiply/divide unit, etc. Parity check also covers over 80% of the chip's latches, including all data path latches. Error recovery is done by re-issuing (retrying) at commit an instruction that cannot commit due to an error that affected its execution. All of these error detection mechanisms and their checkers sprinkled throughout the chip benefit reliability, but add significant complexity and area to the chip.
The DIVA and the SplitDiva checker architectures not only detect errors in the pipe, but also incorrect results due to design corner cases (those cases of a strange combination of happenings and circumstances that conspire to generate errors). These checker architectures achieve this by providing a checker in addition to the core processor. The checker, which can be simpler and slower than the core processor, executes the same instructions as the core, checks the correctness of these instructions and retries instructions that fail. The checker is also designed to take over the program execution in case of a hard core processor failure, but with poor performance. DIVA delivers error detection and correction (including surviving design corner cases, uncorrectable errors and hard errors), but at a significant cost in area and power.
Run-Ahead Execution (RAE) is a microarchitecture optimization that attempts to prefetch for loads further down the execution path when the processor has a lengthy stall (e.g., a load missing the L2 cache). Though RAE is a performance optimization technique primarily, it also improves the failure rate due to soft errors, because the residence time of data in unprotected registers and flops on the processor core is bounded by the initiation of RAE and consequent flushing on a lengthy stall.
Error detection techniques for existing high reliability microprocessors, suffer from high area and power overhead, and might be overkill for most markets. After EDC is added to caches, the unprotected regular structures (e.g. register files) become some of the most important contributors to the failure rate due to soft errors, while the execution units, which occupy a large portion of each processor core's area, are some of the most important contributors to the hard error rate, and, to a lesser extent, soft error rate.
As discussed above, conventional processors mostly detect errors in random access memory, although there is also a need to detect errors arising from register files, execution units, buffers, etc. One of the most efficient (i.e., low overhead) ways for detecting errors in execution units is with residue checking. Residue checking has been implemented for arithmetic units (adders, multipliers, dividers). Some mainframes (e.g., Amdahl's 5990A and 5990M) use a module 3 (2-bit) residue checker for its multiply/divide unit. More recently, microprocessors, such as Fujitsu's SPARC64 microprocessors, adopted the same technique (also for error detection in the multiply/divide unit). The motivation for employing residue-based error detection for their arithmetic units is 1) that the technology of these microprocessors makes transient errors in those units more probable, and 2) that, in time, hard errors could occur in these units and, if not detected, could result in silent data corruption.
Though conventional techniques protect arithmetic units with residues, these techniques are piece-meal and require inefficiently crossing many protection domains variously protected by parity, ECC and residues. Accordingly, a technique is desired that maximizes error detection (minimum silent data corruption in case of an error) with minimum area overhead and minimum complexity.