Some computer systems require high reliability to ensure data integrity and continuous computation, even during a fault or failure. Computer systems involved in banking, telecommunications, stock markets, and other mission critical activities must be reliable. To achieve this reliability, computer systems utilize multiple processors to achieve fault-tolerant computing.
Fault-tolerant computing systems have the ability to tolerate a failure of a component and continue to operate. Some fault-tolerant computing systems use redundant circuit paths so that a failure of one path does not halt operation of the system. Other systems use self-checking circuitry having identical processor units. Each processing unit receives the same inputs to produce the same outputs. These outputs are compared, and if an inconsistency occurs, then both processing modules are halted in order to prevent a spread of possible corrupt data. In some instances, two or more processing units operate in a lockstep mode in which each processor performs the same task at the same time.
Even fault-tolerant computing systems encounter failures and shutdowns. In some self-checking systems, for instance, soft errors (example, a cache error seen by one of the paired processors but not the other) require both processors to be halted and restarted. Other errors also cause failure. For example, processor designs using translation look-a-side buffers with entry checking, parity checking, bus protocol checking, and the like can have one processor detecting an error while the other processor does not.
Failures in fault-tolerant or self-checked processing systems occur for other reasons as well. Processors operate at core voltages that are set by the manufacturer. In multiple processor systems, however, these processors have minor nondeterministic behavior at the core voltages specified by the manufacturer. Even though this nondeterministic behavior does not effect normal processor operations, a pair of self-checked processor operating in lockstep will exhibit system failures. These failures are due to variations in the processors performing requests, responses, and order of appearances on the system interfaces.