“Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a user, the nodes in a cluster appear collectively as a single computer, or entity.
Clustering is often used in relatively large multi-user computer systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
Clusters typically handle computer tasks through the performance of “jobs” or “processes” within individual nodes. In some instances, jobs being performed by different nodes cooperate with one another to handle a computer task. Such cooperative jobs are typically capable of communicating with one another, and are typically managed in a cluster using a logical entity known as a “group.” A group is typically assigned some form of identifier, and each job in the group is tagged with that identifier to indicate its membership in the group.
Member jobs in a group typically communicate with one another using an ordered message-based scheme, where the specific ordering of messages sent between group members is maintained so that every member sees messages sent by other members in the same order as every other member, thus ensuring synchronization between nodes. Requests for operations to be performed by the members of a group are often referred to as “protocols,” and it is typically through the use of one or more protocols that tasks are cooperatively performed by the members of a group. One example of a protocol utilized by many clusters is a membership change protocol, which permits member jobs to be added to or removed from a group. Another example of a protocol is a node start protocol, which enables new nodes to be added to a cluster.
Clustered computer systems place a high premium on maximizing system availability. As such, automated error detection and recovery are extremely desirable attributes in such systems. One potential source of errors is that of a node failure, which ultimately requires that a node be expelled from a cluster before the node can resume clustering. For example, in many clustered computer systems, individual nodes rely on an underlying clustering infrastructure, often referred to as clustering resource services. Due to various error conditions, such as the failure of a cluster-critical job, or a failure within the clustering infrastructure, the infrastructure may need to be re-initialized to permit the node to reregister with the other nodes in a cluster.
In most instances, it would be extremely desirable to automatically recover from a node failure and reconnect the node to the cluster. In some instances, a node may lose communication with other nodes in a cluster, whereby extraordinary measures may be required to reconnect a node to a cluster. However, in other instances, a failure on a node (e.g., a failure in a cluster-critical job) may not immediately affect communications of that node with other nodes in a cluster. In these latter types of failures, a node may lose cluster registration, and appear to other nodes in the cluster that the node is dead. Nonetheless, the node may be functional and alive, but incapable of participating in a cluster. In such instances, it is often desirable to “restart” the node to reintroduce the node to the cluster and re-establish clustering on the node.
As an example, a cluster-wide monitoring job may be used in the various nodes in a cluster to monitor the activities of other member jobs executing on the cluster. If such a monitoring job fails on a node, the node must end, since there is nothing doing the monitoring on that node. Restarting just the monitor may not be sufficient because, while the monitor was down, other jobs the monitor was supposed to monitor may have also gone down. It would also be complicated for a restarted monitor to ascertain what may have happened while the monitor was down.
Conventionally, resolution of the failure of a cluster-critical job requires that the node leave the cluster, and then be restarted to add the node back into the cluster in much the same manner as a node is initially added to a cluster. Typically, the restart of a node is initiated via a manual operation by an administrator or operator, or via an automated script executing on the node. A manual operation necessarily requires human intervention, and thus is prone to human error, as well as reduced system availability while an administrator manually restarts the node.
An automated script running on a failed node is also problematic, since a failed node may be incapable of re-joining a cluster after the node has failed. In particular, a failing node may not be capable of determining what caused its failure. Moreover, if the reason for failure is the loss of clustering information required to join with a cluster (e.g., cluster membership data), the node may not be capable of determining how the node joins with an existing cluster. Furthermore, if the failure that required the node to be restarted was incapable of being remedied through a simple restart procedure, a potential exists that an automated script would lock-up while attempting to continually restart the node without success.
Therefore, a significant need exists in the art for a manner of automating the process of detecting and initiating the restart of a node in a clustered computer system, in particular, to increase system availability and reduce operator intervention.