1. Technical Field
The present invention relates to microprocessors and, in particular, to mechanisms for handling errors in FRC-enabled processors.
2. Background Art
Servers and other high-end computer system are designed to provide high levels of reliability and availability. Soft errors pose a major challenge to both of these properties. Soft errors result from bit-flipping collisions between high-energy particles, e.g. alpha particles, and charge storing nodes. They are prevalent in storage arrays, such as caches, TLBs, and the like, which include large numbers of charge storing nodes. Rates of occurrence of soft errors (soft error rates or SERs) will only increase as device geometry decreases and device densities increase.
Highly reliable systems include safeguards to detect and manage soft errors, before they lead to silent, e.g. undetected, data corruption (SDC). However, to the extent error detection/handling mechanisms take a system away from its normal operations, the system's availability is reduced.
One well-known mechanism for detecting soft errors is functional redundancy checking (FRC). An FRC-enabled processor includes replicated instruction execution cores on which the same instruction code is run. Depending on the particular embodiment, each replicated execution core may include one or more caches, register files and supporting resources in addition to the basic execution units (integer, floating point, load store, etc.). FRC-hardware compares results generated by each core, and if a discrepancy is detected, the FRC system jumps to an error-handling routine. Since FRC errors indicate only that the execution cores disagree on a result, the error handling routine typically unwinds execution to the last know point of reliable data and resets the processor. This reset mechanism is relatively time consuming.
The point(s) at which results from different execution cores are compared represent the FRC-boundary for the system. Errors that are not detected at the FRC boundary can lead to SDC, but even managing detected FRC errors can create its own problems. As noted above, time spent by the system handling FRC-errors takes the system away from its normal operation, reducing system availability.
FRC is only one mechanism for handling soft errors, and for random logic, it is the primary mechanism. Array structures present a different picture. Array structures typically include parity and/or ECC hardware to detect soft errors. ECC hardware can even correct single bit errors on the fly. Further, unlike FRC errors, which only indicate that one of the execution cores generated an error, parity/ECC errors are clearly tied to one of the execution cores by the logic that detects them.
If a parity/ECC error in an execution core is detected before the corrupted data reaches the FRC boundary, the execution core that has the good, i.e. uncorrupted, data can be identified and used to recover from the error. If the corrupted data reaches the FRC boundary before the underlying parity/ECC error is detected, an FRC error will be generated, and the system will be unavailable while the FRC error is addressed. There is thus a race between the logic that processes the corrupted data towards the FRC boundary and the logic that detects its corrupted state. This race can have a significant impact on system availability, due to the longer latency of the reset mechanism triggered by an FRC-error (data mismatch) relative to the recovery mechanism triggered by an underlying parity/ECC error.
The present invention address these and other problems associated with error handling in FRC-enabled processors.