Resource/Application monitoring is a key feature of High Availability systems (“HA systems”). The ability of an HA system is judged by how little manual intervention is required to keep the resources/applications highly available. In this context, the term “resource” generally refers to any managed entity, such as a software application, network component, storage component etc.
When a resource goes down, the HA system should automatically restart the resource quickly without the need for the administrator/user to do anything. At the same time, when there is an inherent problem with the startup of a resource, the HA system should not try to start the resource forever in a loop (start, fail, start . . . ). When a resource is stuck in a loop of starting, failing, and restarting, the resource is said to be “bouncing”. The longer a resource bounces, the more system resources are wasted.
To prevent continuous bouncing of a faulty resource, most HA systems limit the number of times a resource can be restarted. Specifically, after restarting for a certain number of times (“MAX_RESTARTS”), the resource is simply stopped. Thus, the MAX_RESTARTS value serves as a cap on the number of times a faulty resource will bounce.
Unfortunately, when MAX_RESTARTS is reached for a resource, there may not be an inherent problem with starting the resource. The restarts that caused MAX_RESTARTS to be reached for the resource may have occurred in the distant past, or may have occurred sporadically over a long period of time. Consequently, the fact that MAX_RESTARTS was reached on a resource may not reflect anything about the current stability of the resource. Thus, in many cases, even though MAX_RESTARTS has been reached, the resource may function well if the resource would just get restarted. However, because MAX_RESTARTS has been reached, the administrator is forced to start the resource manually.
Consider, for example, a system that uses a RESTART_COUNTER to keep track of how many times a resource is automatically restarted. With each automatic restart of the resource, the RESTART_COUNTER is incremented. If the resource fails once in a while over a long period of time, the RESTART_COUNTER for the resource may eventually reach MAX_RESTARTS for the resource. After the last restart, the resource may be stable for a long period of time. Even after a long period of stability, the resource would not be automatically restarted if the resource fails, since the RESTART_COUNTER has reached MAX_RESTARTS. Thus, the resource could not be restarted automatically and requires user intervention to get started.
To reduce the frequency of administrator intervention, the MAX_RESTARTS may be set to a large value. However, if MAX_RESTARTS is a large value, then the resource will “bounce” for a longer time for unrecoverable failures.
HA systems can be configured to implement various approaches to handling resource failures. An example of a first approach includes: when a resource fails and there are no more restarts, the resource is simply halted/stopped. If the resource is relocatable to another node, then relocation is attempted. If the resource is not relocatable, then the resource is just stopped, forcing the administrator to restart the resource manually.
Another example of an HA system is described in Server Clusters: Architecture Overview For Windows Server 2003 (published by Microsoft Corporation, March, 2003). The approach taken by this system generally includes: when a service/resource fails, a manual “Move” operation has to be done by the Cluster administrator. Specifically, if a resource fails, a Failover Manager might restart the resource, or take the resource offline along with its dependent resources. If it takes the resource offline, it will indicate that the ownership of the resource should be moved to another node and be restarted under ownership of the new node. Enhanced logic for node failover may be used in a cluster with three or more nodes. Enhanced failover includes doing a manual “Move Group” operation in Cluster Administrator.
Another example of an HA system is the VERITAS™Cluster Server from Symantec®. The approach taken by this system generally includes: when a resource fails, do not attempt to restart the resource at all. Instead, move the resource to another server for any kind of resource failure.
Another example of an HA system is the TruCluster Server Version 5.1B by Hewlett Packard®. The approach taken by this system generally includes: when a resource fails, restart the resource for only a specified number of times. After that, a relocation attempt is made. if the resource cannot be relocated, then the resource is just stopped.
Based on the foregoing, it is desirable to provide an HA system that handles the restart of resources more efficiently than the approaches employed by currently available HA systems.