1. Field of the Invention
The present invention relates generally to computer systems. More particularly, the present invention relates to fault tolerant and highly available computer systems.
2. Description of the Background Art
Previous solutions for providing fault tolerance in digital processing are either hardware based, software based, or some combination of both. Fault tolerance may be provided in hardware by running two full central processing units (CPUs) in lockstep, or three CPUs in a “voting” configuration. For example, a system may employ three CPUs executing the same instruction stream, along with three separate main memory units and separate I/O devices which duplicate functions, so if one of each type of element fails, the system continues to operate. Unfortunately, such systems include tremendous system overhead, not only in terms of the number of CPUs required, but also in terms of the infrastructure supporting the CPUs (memory, power, cooling systems, and so on).
Software based solutions typically rely on complete re-running of a program at least three times. This results in effective execution times that are three times longer than if the program was run only once. Combination schemes require both extra hardware (for example, twice the hardware) and extra processing. The extra processing may take the form of software check-pointing. Software check-pointing pertains to the ability to, on an error, “replay” a specific instruction sequence.
The above-discussed prior solutions are expensive in terms of cost and/or system performance. Hence, improvements in systems and methods for providing fault tolerant digital processing are highly desirable.