Fault-tolerance may be viewed as the ability to achieve desired results despite a failure in the system producing those results. A fault-tolerant computing system continues to operate properly in the event of failure of, or faults within, one or some subset of its components.
Some techniques for fault tolerance are based on separating the functionality of a standard server into compute activity and I/O activity. Compute activity is inherently synchronous. Transformations being done on data is deterministic in the number of instructions required to transform that data. I/O activity is inherently asynchronous. I/O activity is dependent on factors such as disk latency, timer ticks, Ethernet packet arrivals and video refresh rates. Verification of the correct operation of a compute environment can be done by comparing the current state of two compute environments.
Another approach to fault tolerance is to employ a checkpoint/restart system in which a primary system periodically transfers the state of the primary system to a secondary, backup system at times that may be referred to as checkpoints. In the event of a failure in the primary system, control may be switched to the secondary system, which may restart operation beginning at the last checkpoint.