1. Field of the Invention
This invention relates to computer systems, and more particularly to detecting and recovering from node failures in a cluster environment.
2. Description of the Related Art
Distributed applications are often implemented as part of commercial and non-commercial business solutions for an enterprise. For example, a company may leverage the use of an enterprise application that includes various databases distributed across multiple computers. Applications of this type, which support E-commerce, typically support hundreds or thousands of sessions simultaneously during periods of peak utilization. For scalability and fault tolerance, the servers running such applications may be clustered.
FIG. 1 illustrates a networked computer system including a cluster 100, according to prior art. Clients 110 may be coupled to cluster 100 through network 120. Clients 110 may initiate sessions with application components running on nodes 140. Load balancer 130 may distribute session requests from clients 100 to nodes 140 to “balance” the total workload among the servers. In some cases, load balancing may amount to nothing more than round-robin assignment of new sessions to cluster members. In other cases, load balancer 130 may have access to data concerning the current workload of each node 140. When a new session request is received, load balancer 130 may use this data to determine which server has the “lightest” workload and assign the new session to that node. Regardless of the distribution algorithm used by the load balancer 130, the capacity of the application component(s) running on the nodes 140 of the cluster is greater that if it were limited to only a single node, and most architectures for cluster 100 include scalability to allow for increasing capacity by adding additional nodes 140 to the cluster.
Another desirable characteristic of an application component(s) executing on a server cluster is high availability. For an application component running in a non-clustered environment, the failure of its server makes the component unavailable until the server is repaired or replaced. This loss of service may be very undesirable for an enterprise, particularly if the function being performed by the application component is, for example, registering orders for products or services. If the application component is executing on a cluster, one or more nodes 140 within the cluster can fail, and the application may continue to provide service on the remaining active servers, although at a reduced capacity. This attribute of a clustered server environment is called “failover”, and it can be implemented in a variety of ways. In some cases, the load balancer 130 may determine that a given node 140 has failed and simply not assign any further work to that node. This insures that new requests will receive service, but does nothing for work that was in-process on the failed server.
Many cluster architectures have been formulated to address the need for graceful failover of cluster members to attempt to minimize the impact of server failure on end users. For a failover to be truly graceful, it should be completely transparent to the client. In most cases, graceful failover within a cluster requires the nodes 140 to be “cluster-aware” to the point of being able to detect the failure of fellow cluster members, and in some cases each server needs to be able to resume the processing of jobs that were executing on the failed server at the time it failed. The increase in complexity for each node 140 to support this level of graceful failover may be quite large in terms of the design, verification, and maintenance of the additional functionality.
It is common in clustered systems running distributed applications that once a client 110 has established a session with a particular instance of an application component, the load balancer 130 will thenceforth, direct client requests associated with that session to the node 140 running that instance of the application component. This may allow facilitate the maintenance of session state data coherency, but can create problems for the client 110 when that particular node 140 fails. Since the client 110 makes service requests over a connection, which has the failed node 140 as an endpoint, failure of this node may result in communications failure from the perspective of the client. Also, state data associated with that particular session may become unavailable with the loss of the node 140.