Typically, in a High Availability Cluster, there is a group of loosely coupled nodes that all work together to ensure a reliable service to clients. The high availability is achieved by continuously monitoring state of applications and all the resources on which the application depends to be alive. If an application abnormally terminates or if the operating system suddenly fails then the applications are automatically restarted on the backup server. This process of restarting the application on a backup server is herein referred to as “fall-over”.
As can be seen, the goal of a High Availability System such as HACMP™ (“High Availability Cluster Multi-Processing”) provided by International Business Machines (“IBM”) of Armonk, N.Y., is to reduce application downtime by continuously monitoring applications for any failure and automatically restore applications in a backup server after a failure. An application crash can be detected by monitoring its resources such as a process ID (“PID”), log message, and connection creation. There are generally two types of application failures that can lead to a complete failure of a service. The first failure type is an application crash wherein the service gets terminated abnormally and unexpectedly. The second failure type is when an application hangs/freezes wherein the service appears to be running but has stopped responding.
Detecting a crashed application is relatively simple, whereas detecting a hung or unresponsive application can be more challenging. For example, when a server application is in a non-responsive state, resources used by the application, such as a PID, memory, CPU usage, and the like usually appear to be normal and the application is still able to accept new connections. Conventional methods for monitoring the availability of an application generally cannot be used to detect a non-responsive condition of a server application. As a result, high availability systems generally cannot detect a hung application effectively.
Therefore a need exists to overcome the problems with the prior art as discussed above.