Continuously available computer systems, i.e., fault-tolerant systems, typically have redundant hardware that execute in clock lockstep, i.e., the CPUs on both computer systems execute the same instructions in a given clock cycle. The failure of one of the computer systems does not typically bring the fault-tolerant system down and applications generally continue to execute on the redundant computer system without any interruption.
Lack of functional interruption is often critical in real-time redundant systems. Servers that run the New York Stock Exchange, computers that operate on the space shuttle, and chips that operate in some artificial hearts are examples of fault-tolerant systems. If a component does fail, a backup, generally an identically configured computer system or chip, exists to replace the failed component and pick up operations at the exact point of failure in terms of the functions being performed and the state of the system memory. One way to achieve this redundancy is to execute the components in lockstep synchronicity. In a fault-tolerant system, the two (or more) computer systems are typically physically identical e.g., both contain the same type of processor from the same manufacturer attached to identical motherboards. The computer systems share a common clock such that when an instruction is executed on one computer system, it is simultaneously executed on the other. Both write to the same address in memory in their respective data stores, and both take generally the same amount of time to complete a task. In the event that a computer system fails, the other takes over and is relied upon by the user.
When a failure does occur, the failed computer system is usually replaced as soon as possible because the system as a whole is no longer redundant and fault-tolerant. To facilitate the addition of a replacement computer system and to enable the replacement computer system to execute in lockstep with the executing (online) system, memory from the online computer system (the application and system state) generally needs to be copied to the newly added board. Traditional methods include halting all applications, copying the entire memory to the new computer system, and then resuming all processes in lockstep. However, halting the entire fault-tolerant system while the memory is copied may be inefficient and may not always be an option.