1. Field of the Invention
This invention relates to computer systems, and more particularly to detecting and recovering from failures in a group of systems.
2. Description of the Related Art
Distributed applications are often implemented as part of commercial and non-commercial business solutions for an enterprise. For example, a company may leverage the use of an enterprise application that includes various databases distributed across multiple computers. Applications of this type, which support E-commerce, typically support hundreds or thousands of requests simultaneously during periods of peak utilization. For scalability and fault tolerance, the servers running such applications may be clustered.
FIG. 1 illustrates a networked computer system including a server cluster 100, according to prior art. Clients 110 may be coupled to cluster 100 through network 120. Clients 110 may initiate sessions with application components running on servers 140. Load balancer 130 may distribute requests from clients 100 to servers 140 to “balance” the total workload among the servers. In some cases, load balancing may amount to nothing more than the round-robin assignment of new requests to cluster members. In other cases, load balancer 130 may have access to data concerning the current workload of each server 140. When a new request is received, load balancer 130 may use this data to determine which server has the “lightest” workload and assign the new request to that server. Regardless of the distribution algorithm used by the load balancer 130, the capacity of the application component running on the servers 140 of the cluster is greater that if it were limited to only a single server, and most architectures for cluster 100 include scalability to allow for increasing capacity by adding additional servers 140 to the cluster.
Another desirable characteristic of an application component executing on a server cluster is high availability. For an application component running in a non-clustered environment, the failure of its server makes the component unavailable until the server is repaired or replaced. This loss of service may be very undesirable for an enterprise, particularly if the function being performed by the application component is, for example, registering orders for products or services. If the application component is executing on a cluster, one or more servers 140 within the cluster can fail, and the application may continue to provide service on the remaining active servers, although at a reduced capacity. This attribute of a clustered server environment is called “failover”, and it can be implemented in a variety of ways. In some cases, the load balancer 130 may determine that a given server 140 has failed and simply not assign any further work to that server. This insures that new requests will receive service, but does nothing for work that was in-process on the failed server.
Many cluster architectures have been formulated to address the need for graceful failover of cluster members to attempt to minimize the impact of individual server failures on end users. In most cases, graceful failover within a cluster requires the servers 140 to be “cluster-aware” to the point of being able to detect the failure of fellow cluster members, and in some cases each server may be able to resume the processing of jobs that were executing on the failed server at the time it failed. The increase in complexity for each server 140 to support this level of graceful failover may be quite large in terms of the design, verification, and maintenance of the additional functionality.
Another aspect of cluster management is the reintroduction of recovering/recovered servers into the cluster. From an enterprise point of view, it is desirable to return a server 140 to use as quickly as possible after it has failed. In some instances, a failed server may be recovered by simply performing a server restart on the failed unit. Depending upon the amount of time needed for the cluster to recover from a server failure, the failed unit may be restarted before the cluster has completely recovered from its failure. This type of situation can lead to quite complex problems in maintaining system consistency.