In a mission-critical, distributed computing environment, fault tolerance is essential for ensuring high system reliability and availability. “Fault tolerance”, as used herein, means that a client, whether a human end user or an element of a computer system or network, does not perceive any computer failure when the client requests execution of a task or series of tasks, even though the computer system or network may incur partial failure in the course of completing the client's request.
Many fault-tolerant systems are known in the data processing arts. Some are implemented mainly in hardware, such as hardware-assisted clustering, while others are implemented mainly in software, such as message-based systems. However, because it is relatively complex to implement, fault tolerance is usually available only in high-end computer systems, and the implementing software is typically bulky and expensive.
For the reasons stated above, and for other reasons stated below which will become apparent to those skilled in the art upon reading and understanding the present specification, there is a significant need in the art for fault-tolerant systems and methods that are light-weight, inexpensive, and readily scalable for use on computer systems of virtually any size or market segment.