It is often desirable, in both distributed and non-distributed computing systems, to provide a mechanism for making the system tolerant to faults such as process failures and machine failures. The most important aspects of such fault tolerance mechanisms are generally error detection and error recovery. Conventional computing systems have been implemented which use process replication in conjunction with voting to perform error detection, and checkpointing to perform error recovery. Process replication generally involves running multiple copies of a given target program on different machines, and is also referred to as "N-version" or "N-modular" programming. Each of the copies of the program returns data values at specified breakpoints, and the voting process is used to determine if any of the processes or machines have failed. For example, if the returned data from all but one of the copies is the same, it can be assumed that the minority copy has experienced a failure. In the event of such a failure, a checkpoint is taken using one of the copies which is executing properly, and the failed process is restarted from the checkpoint. Details regarding these and other conventional techniques are described in, for example, J. Long, W. K. Fuchs, and J. A. Abraham, "Forward recovery using checkpointing in parallel systems," Proc. IEEE International Conference on Parallel Processing, pp. 272-275, 1990; D. K. Pradhan and N. H. Vaidya, "Roll-forward and rollback recovery: Performance-reliability trade-off," Proc. 24th Fault-Tolerant Computing Symposium, pp. 186-195, 1994; D. K. Pradhan and N. H. Vaidya, "Roll-forward checkpointing scheme: A novel fault-tolerant architecture," IEEE Transactions on Computers, 34(10):1163-1174, October 1994; and Algirdas A. Avizienis, "The Methodology of N-Version Programming," in Michael R. Lyti, editor, Software Fault Tolerance, pp. 23-46, John Wiley & Sons Ltd., 1995.
Conventional fault tolerance techniques generally require the modification of either source code or binary executable code to add the above-noted error detection and recovery functionality. These modifications are typically performed prior to execution of the target program and often require the user to edit files or to run direct instrumentation software, which can be inefficient. Moreover, conventional techniques which rely on an operating system to detect errors have the disadvantage of being unable to preserve data integrity when no operating system exception is triggered. Other conventional schemes use algorithm-based detection methods that are generally not applicable to many types of programs.