Computer systems often perform critical control, analysis, communication, and other functions in hostile environments. When these systems are physically difficult or impossible to reach, it is important that adequate redundancy be provided so that malfunctions and spurious errors can be detected and automatically recovered. One common way of protecting against computer system errors is to employ dual-modular redundancy or triple-modular redundancy: to operate two or three (or more) system modules in lockstep and compare their behavior. If several identical modules perform the same operation, then—in theory—any differences between the modules' behavior may indicate that one or more of the modules has malfunctioned. Differences could be detected—again, theoretically—simply by comparing signals present at certain key places in the systems (for example, at the address and data buses) and starting error recovery procedures whenever a signal mismatch is detected.
In practice, clock skew and similar effects cause signal mismatches even when the modules are operating properly. Since error recovery can be a computationally expensive process, erroneous lockstep-failure signals can seriously degrade system performance. Also, error recovery may involve different operations on each of the modules, and there may be no effective redundant system to protect the recovery against errors that occur then. Furthermore, traditional lockstep redundant systems contain specialized hardware circuits to perform signal comparison. These circuits may reduce the system's flexibility to operate as an ordinary multiprocessor system when redundant processing is not required.