In order to implement fault tolerance, some computing systems execute duplicate copies of a user program on multiple processors in a lock-step fashion. In a dual-modular redundant system, two processors are used, and in a tri-modular redundant system, three processors are used. Outputs of the duplicate copies of the user program are compared or voted, and in the event the outputs match, they are consolidated and sent to other portions of the computing system. If the outputs do not match, the processor experiencing a computational or hardware fault is voted out and logically (though not necessarily physically) removed from the system.
In order for the logically removed processor to resume lock-stepped execution of the duplicate copy of the user program, the memory of the failed processor needs to be copied from one of the remaining processors. One mechanism to perform the memory copy is to stop execution of user programs on the processor or processors in the system that did not experience a fault, and copy the entire memory of one of the processors to the memory of the failed processor. However, the amount of memory to be copied may be in the gigabyte range or greater, and thus the amount of time the entire computer system is unavailable may be significant. A second method to copy memory is to cyclically pause the user programs of the non-failed processors, and copy a small portion of the memory from a non-failed processor to the memory of the failed processor. Eventually, all the memory locations will be copied, but inasmuch as the user programs are operational intermittently with the copying, memory locations previously copied may change. Thus, such a system may need to track memory accesses of a user program to portions of the memory that have already been copied to the memory of the failed processor, and at some point all the non-failed processors stopped and all the memory locations changed by user programs after the memory copy process copied to the memory of the non-failed processor. In practice, however, this last step of copying memory locations changed by the user programs may involve a significant number of memory locations, and thus the amount of time that the user programs are unavailable because of this copying may be excessive.