1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular, to a computer implemented method for improving failure tolerance in data processing systems. Still more particularly, the present invention relates to a computer implemented method, system, and computer usable program code for energy-efficient failure detection and masking.
2. Description of the Related Art
An application may produce an incorrect output during execution. The incorrect output may be caused by an error in the code or by a soft error. A soft error is an error that occurs when, for example, a bit in a memory gets set or reset without an instruction of the application causing the bit to change.
Failure detection is the process of detecting errors, including soft errors, as they arise during execution of instructions. A soft error is not a permanent failure because the same bit may not flip and the same error may not occur again during another execution of the application. Some reasons for soft errors are noise, power surges, and cosmic radiation.
A soft error can be corrected by rewriting the incorrect data with the correct data, such as by setting an affected bit to the correct state. Current technology provides methods for handling soft errors such as by redundantly computing the results of an instruction and detecting divergence and possibly correcting the results via voting, and/or by using error correction coding to check when the contents of a register is changed. This invention is concerned with errors that affect computations.