Certain computer systems need to provide service to customers even in the face of arbitrary combinations of hardware and software failures. Of course, there are some combinations of failures that can disable even the most robust system regardless of the reliability measures that are incorporated into the system. Nevertheless, systems that must be highly available can benefit from a design and implementation that provides for some robustness, for example, through failure detection or recovery.
Designing robust systems can be a difficult task, particularly when complex software (e.g., operating systems and customer applications) interacts with complex hardware. Software tools that assist with creating a robust program are generally cumbersome to use and require significant manual intervention by the designer. As a result, the software tools are not used, and the resulting software is not as robust as it otherwise could be.
Therefore, a need has long existed for fault monitoring, detection, and recovery functionality that overcomes the problems noted above and others previously experienced.