1. Field of the Invention
The present invention generally relates to the field of clustered computer systems, and in particular to a method for reducing the downtime of a clustered application using user-defined rules.
2. Description of Related Art
A cluster is a group of computers that work together to run a common set of applications and appears as a single system to the client and applications. The computers are physically connected by cables and programmatically connected by cluster software. These connections allow the computers to use failover and load balancing, which is not possible with a stand-alone computer.
Clustering, provided by the cluster software, provides high availability for mission-critical applications such as databases, messaging systems, and file and print services. High availability means that the cluster is designed so as to avoid a single point-of-failure. Applications can be distributed over more than one computer, achieving a degree of parallelism and failure recovery, and providing more availability. Multiple nodes in a cluster remain in constant communication. If one of the nodes in a cluster becomes unavailable as a result of failure or maintenance, another node takes over the failing node's workload and begins providing service. This process is known as failover. With very high availability, users who were accessing the service should be able to continue to access the service, and should be unaware that the service is being provided by a different node.
In general, mission critical applications of an enterprise may have an availability requirement as high as 99.999%, which translates to no more than 5 minutes of downtime in a year. The failover feature is usually provided in clusters that run such mission critical applications.
However, the failover process itself could be time-consuming in situations where the failover is triggered after the application fails or terminates in an abnormal way. The process of failover involves various stages (FIG. 1B) of which the application data recovery could take a substantial amount of time, in some cases as high as 15-30 minutes.
The time needed for transferring the resources from a failed node to and resuming operation on a surviving node is determined by a variety of factors (such as the amount of data, and the state of data when the failure happened) and ranges from a few seconds to up to 30 minutes or even longer in some situations. The downtime duration of several minutes would be unacceptable for mission-critical applications.
Thus, there is a need to a method to reduce application downtime in a cluster that would also provide user control capability on the cluster level.