Fault tolerance is an engineering principle that requires a system to continue operating despite failures or faults, albeit with a possibly diminished level of service or capacity. Fault tolerant design has been applied to computer systems to ensure system availability and crash resilience, generally through replication and redundancy. Replication requires multiple components to provide identical functionality while operating in parallel to improve the chances that at least one system is working properly. One way to effect fault tolerance is by following a quorum rule based on a majority of votes as received from the constituent replicated components, the output agreed by the majority is used as the output of the whole system. Redundancy requires multiple components to provide identical functionality, but fault tolerance is instead provided by switching or “failing” over from a faulty component to a spare redundant component. A fault tolerant system can combine both replication and redundancy at different levels of component design. For instance, storage servers conventionally use redundancy in data storage through a redundant array of inexpensive disks, or RAID, hard drive configuration, while also including replicated power supplies to serve as spares in case of the failure of the power supply currently in use.
In general, replication and redundancy provide fault tolerance at the physical component- or hardware-level, where failure conditions can be readily contained by removing a component from service until a permanent repair can be later effected. Replicated or redundant fault tolerance can also be used in software. Providing software fault tolerance through replication or redundancy, though, can be expensive and inefficient in terms of resource utilization and physical component costs. Additionally, software fault tolerance adds the complication of needing to provide continued service often without the certainty that the correct faulty software component has been both identified and rendered harmless. A software error that potentially affects system state could remain undetected and persist latently, only to later rematerialize with possibly catastrophic consequences, despite earlier efforts to undertake fault tolerance.
Alternatively, fault recovery can be used for software to directly address underlying causes of fault or failure, rather than relying on the indirect quorum voting or failover solutions used in fault tolerance. In general, software fault recovery can be provided through either roll forward and roll back. Roll forward requires a software system to attempt to correct its system state upon detecting an error and continue processing forward from that point in execution on by relying on the self-corrections made as being sufficiently remedial. Roll back requires a software system to revert to an earlier, and presumably safe, version of system state and continue processing forward from the earlier version on after backing out any erroneous system state.
However, even with replicated multithreaded software execution, resilience to latent software faults and failures, colloquially referred to as “bugs,” is not assured due to the nondeterministic nature of multithreaded programs. Current multicores execute multithreaded code nondeterministically because, given the same inputs, execution threads can potentially interleave their memory and I/O operations differently in each execution. The nondeterminism arises from small perturbations in the execution environment due to, for instance, other processes executing simultaneously, differences in operating system resource allocation, cache and translation lookaside buffer states, bus contention, and other factors relating to microarchitectural structures. Software behavior is subject to change with each execution due to multicore nondeterminism, and runtime faults, such as synchronization errors, can appear at inconsistent times that may defy subsequent efforts to reproduce and remedy. Multithreaded programming is difficult and can lead to complicated bugs. Existing solutions focus on hardware fault tolerance and are ill-suited to resolving the kinds of multithreaded programming bugs necessary for achieving software fault tolerance.