Computer clusters are an increasingly popular alternative to more traditional computer architectures. A computer cluster is a collection of individual computers (known as nodes) that are interconnected to provide a single computing system. The use of a collection of nodes has a number of advantages over more traditional computer architectures. One easily appreciated advantage is the fact that nodes within a computer cluster may fail individually. As a result, in the event of a node failure, the majority of nodes within a computer cluster survive in an operational state. This has made the use of computer clusters especially popular in environments where continuous availability is required.
Single system image (SSI) clusters are a special type of computer cluster. SSI clusters are configured to provide programs (and programmers) with a unified environment in which the individual nodes cooperate to present a single computer system. Resources, such as filesystems, are made transparently available to all of the nodes included in an SSI cluster. As a result, programs in SSI clusters are provided with the same execution environment regardless of their physical location within the computer cluster. SSI clusters increase the effectiveness of computer clusters by allowing programs (and programmers) to ignore many of the details of cluster operation. Compared to other types of computer clusters, SSI clusters offer superior scaleability (the ability to incrementally increase the power of the computing system), and manageability (the ability to easily configure and control the computing system). At the same time, SSI clusters retain the high availability of more traditional computer cluster types.
The ability of computer clusters to survive node failure does not mean that these failures have no effect. Instead, it is generally the case that each node failure will have a number of undesirable consequences. One of these consequences is the termination of each process executing at a failed node. Loss of a node may also have a much wider effect. For example, in some cluster types, failure of a node will result in termination of all processes that originated at, and later migrated from, the failed node.
Another undesirable consequence of node failure is severance of process relationships. Process relationships are a way in which different processes interact. In UNIX.RTM. and UNIX-like environments, processes relationships include parent-child relationships, process groups, and sessions. In general, failure of any node within a computer cluster will result in the loss of a number of processes, severing a number of process relationships. Each severed relationship must be rebuilt or cleaned up following node failure.
The undesirable consequences of node failure makes effective failure recovery an indispensable component of computer clusters. To be effective, failure recovery must minimize the number of processes terminated during a node failure. Thus, it is important to provide a mechanism for preserving processes that have migrated from a failed node. Effective failure recovery must also provide a mechanism for rebuilding severed process relationships.
One potential strategy for performing failure recovery is to have the nodes included in a computer cluster perform global reconciliation following each node failure. During global reconciliation, the nodes exchange a series of messages. The messages allow the nodes to determine the effect of the node failure on the processes surviving within the computer cluster. The nodes may then take action, such as rebuilding process relationships, to minimize the effect of the node failure. Unfortunately, practice has shown global reconciliation to be extremely message intensive. Worse, the number of messages required tends to grow geometrically with the number of nodes included in the computer cluster. This limits the use of global reconciliation to small computer clusters that include only a small number of nodes.
A second potential strategy for performing failure recovery is to have each process included in a process relationship reliably track the location of all other processes included in the relationship. This allows reconciliation to be performed with a minimal number of intra-node messages. Unfortunately, this type of tracking involves a substantial runtime penalty each time a process migrates between nodes within a computer cluster. This runtime penalty makes the use of this tracking strategy impractical within most computer clusters.
Based on the preceding paragraphs, it is clear that there is a need for failure recovery techniques for computer clusters. These failure recovery techniques must minimize the number of processes terminated by node failure and reconstruct processes relationships severed by node failure. Failure recovery must also be performed with a limited number of intra-node messages and without incurring a substantial runtime penalty during normal operation of the computer cluster.