A number of commercial software-based fault tolerance products are known in the art. These include Microsoft Cluster Server (MSCS), available from Microsoft Corporation of Redmond, Wash., USA, and Legato Automated Availability Manager, available from Legato Systems Inc., www.legato.com. Another known system is the Software-Implemented Fault Tolerance (SwiFT) system, available from Avaya Inc. of Basking Ridge, N.J., USA, and described in greater detail at http://www.research.avayalabs.com/project/swift. Such systems typically operate in a distributed computing environment that includes multiple computers or other computing machines. For example, a client-server environment is one type of distributed computing environment in which fault tolerance systems are utilized.
The above-noted conventional fault tolerance systems typically include a failure detection component and a failure recovery component. The failure detection component determines if a monitored application, process or other program has terminated, aborted or otherwise failed. For example, in the above-noted SwiFT system, a monitoring process referred to as watchd serves as the failure detection component. The recovery component initiates recovery actions in the event that a failure is detected by the failure detection component. A given recovery action may involve restarting the program on the same machine or another machine. As is well known, a program may be restarted from its initial starting point or via rollback to a designated checkpoint subsequent to its initial starting point.
One common technique for failure detection involves monitoring messages and other signals from the operating system of a given machine to determine if a program on that machine has failed. Another technique involves periodic polling of the program to determine if the program is still “alive.” Other techniques focus on monitoring of the program environment or resource consumption. With techniques of this type, a failure may be indicated if a set of resources currently being consumed exceeds a threshold or if a set of available resources needed for proper operation of the program decreases below a threshold. Still other failure detection techniques involve modification of the program being monitored. An example of this type of technique is the insertion of a “heartbeating” mechanism in a program, with the mechanism being monitored by another program external to the monitored program.
There are a number of significant drawbacks associated with the conventional failure detection techniques identified above. For example, these techniques are unable to provide adequate detection of certain types of failures, such as program hangs and performance degradation. Although certain fault tolerance software systems, such as the MSCS system, support the creation of custom libraries to augment failure detection, the application program interfaces (APIs) and processes required to create these libraries can often be unduly complicated. Moreover, such custom libraries generally must be created uniquely for each specific application, process or other program that is to be monitored.
A need therefore exists for an improved fault tolerance software system that can detect a wider range of failures than conventional systems, while avoiding the complexities associated with creation of custom libraries.