In order to implement fault tolerance, some computing systems execute duplicate copies of a user program on multiple processor elements in a lock-step fashion. In a dual-modular redundant system, two processor elements are used, and in a tri-modular redundant system, three processor elements are used. Outputs of the duplicate copies of the user program are compared or voted, and in the event the outputs match, they are consolidated and sent to other portions of the computing system. If the outputs do not match, the processor element experiencing a computational or hardware fault is voted out and logically (though not necessarily physically) removed from the system.
In order for the logically removed processor element to resume lock-stepped execution of the duplicate copy of the user program, the memory of the failed processor element needs to be copied from one of the remaining processor elements executing the user program. One mechanism to perform the memory copy is to stop execution of user programs on the processor element or processor elements in the system that did not experience a fault, and copy the entire memory of one of the processor elements to the memory of the failed processor element. However, the amount of memory to be copied may be in the gigabyte range or greater, and thus the amount of time the user program is unavailable may be significant. A second method to copy memory is to cyclically pause the user programs of the non-failed processor elements, and copy a small portion of the memory from a non-failed processor element to the memory of the failed processor element. Eventually, all the memory locations will be copied, but inasmuch as the user programs are operational intermittently with the copying, memory locations previously copied may change. Thus, such a system needs to track memory accesses of a user program to portions of the memory that have already been copied to the memory of the failed processor element. At some point, all the non-failed processor elements are stopped and the memory locations changed by user programs after the memory copy process are copied to the memory of the non-failed processor element. In practice, however, this last step of copying memory locations changed by the user programs may involve a significant number of memory locations, and thus the amount of time that the user programs are unavailable may be excessive.
The problems are further exacerbated in computer systems where the processor elements executing duplicate copies of the user program are distributed through a plurality of computer systems, and those plurality of computer systems also have other processor elements executing other user programs. Depending on the architecture and the interconnections of the various computer systems, copying memory from a non-failed processor element to a failed processor element may affect operation of other logically grouped processor elements executing different user programs.