“Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a user, the nodes in a cluster appear collectively as a single computer, or entity.
Clustering is often used in relatively large multi-user computer systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
Clusters typically handle computer tasks through the performance of “jobs” or “processes” within individual nodes. In some instances, jobs being performed by different nodes cooperate with one another to handle a computer task. Such cooperative jobs are typically capable of communicating with one another, and are typically managed in a cluster using a logical entity known as a “group.” A group is typically assigned some form of identifier, and each job in the group is tagged with that identifier to indicate its membership in the group.
Member jobs in a group typically communicate with one another using an ordered message-based scheme, where the specific ordering of messages sent between group members is maintained so that every member sees messages sent by other members in the same order as every other member, thus ensuring synchronization between nodes. Requests for operations to be performed by the members of a group are often referred to as “protocols,” and it is typically through the use of one or more protocols that tasks are cooperatively performed by the members of a group.
Clustered computer systems place a high premium on maximizing system availability. As such, automated error detection and recovery are extremely desirable attributes in such systems. Some errors on a node may only require that individual members terminate, while others may require that an entire node (including all members and the clustering infrastructure) ultimately terminate. To maintain system availability and integrity, the members of each group are often required to know what other members currently exist in their group. Thus, whenever a member or node fails, members on other nodes in a clustered computer system typically must be notified that certain members are leaving their respective groups.
To handle such notification, many clustered computer systems rely on protocols known as membership change protocols to distribute membership change information to the various members of a group. Membership Change Messages (MCM's) are typically used to initiate membership change protocols, and in some systems, reasons for a membership change are incorporated into the messages. Two such types of reasons are “member leave” and “node leave”, with the former indicating that a particular member on a node is leaving a group, and the latter indicating that all members on a node are leaving their respective groups.
Node leave membership changes are typically utilized whenever a failure is detected that affects all clustering on a node—typically responsive to detection of a failure by the clustering infrastructure for that node. However, a number of types of failures may be detected by individual members, and conventionally, a member that detects an error or failure will initiate a member leave membership change to unregister itself from its group. Member-detected errors, for example, are often tied to attempts to access a resource with a member job, whereby a failure is detected whenever the access attempt is unsuccessful. For example, for a resource such as a database or a file system, an error may be detected in response to an unsuccessful access attempt to that resource.
Through the use of member-initiated membership changes, unregistration of individual members is often haphazard and inefficient, given that each member that accesses a failed resource will not initiate a membership change until an access attempt is made on that resource. Some members may rarely, if ever, access a failed resource, so a substantial amount of time could pass between failure detections by multiple members on a node.
Some member-detected errors may ultimately require that an entire node be unregistered from a cluster. Nonetheless, rather than performing a node leave membership change, individual member leave membership changes are typically performed one-by-one by members as they individually detect the errors. In addition to the inefficiency of processing the multiple membership changes, data synchronization and other errors can arise that compromise the integrity of a clustered computer system.
As an example, member-detected errors are particularly problematic when dealing with dependent groups. A dependent group relationship may exist between two groups when one group (referred to as a source group) is required to be active for another group (referred to as a target group) to function properly. For example, a dependent group relationship may exist between a database group and a file access group, since the activities of a database system ultimately depend on the ability to retrieve the data in the database from an external file system. Likewise, applications and data groups may be related as dependent groups, as an application may not be capable of executing without access to its important data.
In clustered computer systems where multiple nodes may manage a shared resource such as a database or file system, the failure in a member that accesses a resource often requires a “failover” to occur to shift the access point for a resource from the node the failed member was on to another member on another node. For example, if a database engine on a node fails, control of the access to the database typically must be shifted to another node to maintain availability of the database.
With dependent groups, a failover of a target group cannot occur until after a source group failover has been completed. Thus, for example, in the case of a database/file access dependency, a database group failover cannot occur before a file access group completes its failover.
In the case of node failures which initiate node leave membership changes, dependent failovers are conventionally handled automatically during processing of the node leave MCM protocols, using proper ordering to ensure that a source group fails over before any target group that depends on the source group does so. Moreover, in the event of a node leave, a target group is assured that the source group will also failover, so it is permissible for a target group to wait for a source group failover to occur once a node leave operation is initiated. This occurs even if the target group receives a node leave Membership Change Message prior to the source group, which is important in a clustered computer system as communication delays and message ordering rules may cause different orderings of MCM protocols to occur, i.e., it sometimes cannot be assured that a message to initiate a source group failover will be sent before a message to initiate a target group failover. In some clustered computer systems, a target MCM protocol may be held off until after a source MCM protocol is delivered.
In the event of a member-detected failure, however, or any other event that initiates a member leave membership change, it is often not possible to ensure dependent failovers can occur, because a target group cannot be ensured that a source group will failover as well. Thus, waiting for a source to failover when a target failover needs to be performed could cause the target group to hang, particularly if a source does not detect the error that caused the target group to initiate the membership change. In addition, a risk of data corruption may exist between source members whenever a target failure occurs without an attendant source failover. For this reason, the use of member leave membership changes to initiate failovers in response to detected errors can be inefficient and unreliable.
Therefore, a significant need exists in the art for a manner of further automating the process of detecting and handling detected errors in a clustered computer system, in particular, to handle dependency failovers in response to member-detected errors in an automated and efficient manner.