1. Field of the Invention
This invention relates to high availability (HA) computer clusters. More particularly, the invention relates to a system and method for detecting state changes of resources in an HA cluster.
2. Description of the Related Art
High-availability clusters, also known as HA clusters, are computer clusters that are implemented primarily for the purpose of improving the availability of resources that the cluster provides. They operate by having redundant computers or nodes which are used to provide service when resources fail. For example, normally if a process fails then the process will remain unavailable until an administrator re-starts it. HA clustering remedies this situation by detecting when a process or other resource has failed and automatically re-starting the process or resource, e.g., on another node, without requiring administrative intervention. HA clusters are often used to ensure that resources such as databases, file systems, network addresses, or other resources remain available for applications which require a high degree of dependability, such as electronic commerce websites or other business applications.
In typical HA cluster implementations, the software that ensures high availability of resources periodically polls the state of the resources, e.g., by actively checking to determine whether the resources are functioning properly or not. For example, in some systems the software uses a user-configurable time-period to schedule the periodic polling of all the managed resources. In many implementations, all of the resources are polled at the same time. One disadvantage of this approach is that polling all of the resources at the same time can cause a large spike in the system load, which can lower the responsiveness of the system. This can even lead to spurious failovers when the load created by the monitoring mechanism interferes with the monitoring itself.
Another problem with this method relates to the monitoring of offline resources. In some cases, HA clusters may need to ensure that certain resources are online on only one node. To avoid resources from being online on more than one node, resource states have to be monitored on nodes on which they are expected to be offline. Typically, offline monitoring is more expensive because it involves a complete scan of a system-level data structure in order to ascertain the absence of the resource. For example, for offline process monitoring, the entire process-table may need to be scanned to make sure that the process is not online. Similarly, for a file system mount point, the entire mount tab file may be scanned for the same purpose. This causes increased load on the system.
Also, since this method relies on periodic polling at scheduled time intervals, instantaneous detection of resource state changes is not possible, which may result in failed resources remaining in a failed state until the failure is discovered at the next scheduled polling. The delay in detection of resource state changes is a function of the monitoring interval. Larger monitoring intervals lead to longer failover times, thereby increasing service down-time.
Other HA clusters use a slightly refined version of this method by attempting to stagger the polling of different resources in order to ensure that all of the resources are not polled at the same time. This may help to eliminate large spikes in the system load, but the average system load caused by the resource polling still remains high, and lag time for discovering resource failure is still a problem.