1. Field
The present disclosure pertains to the field of data processing, and more particularly, to the field of error mitigation in data processing apparatuses.
2. Description of Related Art
As improvements in integrated circuit manufacturing technologies continue to provide for smaller dimensions and lower operating voltages in microprocessors and other data processing apparatuses, makers and users of these devices are becoming increasingly concerned with the phenomenon of soft errors. Soft errors arise when alpha particles and high-energy neutrons strike integrated circuits and alter the charges stored on the circuit nodes. If the charge, alteration is sufficiently large, the voltage on a node may be changed from a level that represents one logic state to a level that represents a different logic state, in which case the information stored on that node becomes corrupted. Generally, soft error rates increase as circuit dimensions decrease, because the likelihood that a striking particle will hit a voltage node increases when circuit density increases. Likewise, as operating voltages decrease, the difference between the voltage levels that represent different logic states decreases, so less energy is needed to alter the logic states on circuit nodes and more soft errors arise.
Blocking the particles that cause soft errors is extremely difficult, so data processing apparatuses often include techniques for detecting, and sometimes correcting, soft errors. These error mitigation techniques include dual-modular redundancy (“DMR”) and triple-modular redundancy (“TMR”). With DMR, two identical processors or processor cores execute the same program in lockstep, and their results are compared. With TMR, three identical processors are run in lockstep.
An error in any one processor is detectable using DMR or TMR, because the error will cause the results to differ. TMR provides an advantage in that recovery from the error may be accomplished by assuming that a matching result of two of the three processors is the correct result.
Recovery in a DMR system is also possible by checking all results before they are committed to a register or otherwise allowed to affect the architectural state of the system. Then, recovery may be accomplished by re-executing all instructions since the last checkpoint if an error is detected. However, this approach may not be practical due to latency or other design constraints. Another approach is to add a rollback mechanism that would permit an old architectural state to be recovered if an error is detected. This approach may also be impractical due to design complexity, and may suffer from the problem that the results of re-execution from a previous state may differ from the original results due to the occurrence of a non-deterministic event, such as an asynchronous interrupt, or the re-execution of an output operation that is not idempotent.
Additionally, DMR and TMR may actually increase the error rate because their implementation requires additional circuitry subject to soft errors, and because they may detect errors that would otherwise go undetected but not result in system failure. For example, an error in a structure used to predict which branch of a program should be speculatively executed may result in an incorrect prediction, but the processor would automatically recover when the branch condition was ultimately evaluated.