1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular, to a computer implemented method for improving failure tolerance in data processing systems. Still more particularly, the present invention relates to a computer implemented method, system, and computer usable program code for tolerating soft errors by selectively duplicating computation.
2. Description of the Related Art
An application may produce an incorrect output during execution. The incorrect output may be caused by an error in the code or by a soft error. A soft error is an error that occurs when, for example, a bit in a memory gets set or reset without an instruction of the application causing the bit to change.
A soft error is not a permanent failure because the same bit may not flip and the same error may not occur again during another execution of the application. Some reasons for soft errors are noise on a communication line, power surges, and cosmic radiation.
A soft error can be corrected by rewriting the incorrect data with the correct data, such as by setting an affected bit to the correct state. Current technology provides methods for handling soft errors. One method of soft error failure detection is to use error correction code. An error correction code, such as code implemented in an error correction tool, detects a soft error on the fly and performs computations to correct the error. Another method of handling soft errors is masking. Masking is a method where an error is prevented from propagating in a manner that the output of the application remains unaffected by the error.