The present invention relates to error recovery in a computing system and, more particularly, to transparent recovery from a hardware fault detected within a computing system.
A fault which occurs during execution of machine instructions in a computing system often renders data or subsequent execution of machine instructions invalid. Rather than halt operation entirely and restart the computing system, it is desirable to recover from the fault and continue processing with a minimum amount of disruption while ensuring that data and subsequent execution of machine instructions will be valid.
Software recovery techniques are known. In one such technique, software periodically records enough data to completely restore the system to a checkpoint where the system state is known to be valid for all operating purposes. When a fault is detected, file modifications performed since the last checkpoint must be undone, the computing system is reset to the last checkpoint, and the system is restarted from that point.
However, such a technique is not transparent to the user because the user is required to insert programming code at a proper place in the program to record enough information to restore the system to a valid state. Since the scheme requires the user to select which information to record at each checkpoint and at which time, it is prone to human error. If the checkpoint code is misplaced, needed data may be overwritten or otherwise lost before proper recording.
Another disadvantage of this technique is the requirement of almost constant interaction between the gram and the operating system which seriously degrades operating system efficiency. Furthermore, once the fault is detected, the process must be reversed until the last checkpoint is reached. This seriously degrades recovery time, particularly in large systems where large file structures must be modified.
Another recovery scheme uses modular redundancy. In this scheme, two or more processors execute identical code in parallel. At periodic checkpoints, the results from the two processors are compared. If the results are found to differ, an arbitration scheme chooses between the two results. The duplication of hardware is almost always cost prohibitive, and the extra hardware creates the possibility of a greater number of fault occurrences.