As is known in the art, a computer network cluster is a collection of interconnected computers which share resources such as data storage. The individual computers, or nodes, are connected through both a physical and a software-level interconnect. The independent nodes are integrated into a single virtual computer, appearing to an end user as a single computing resource. If one node fails, the remaining nodes will handle the load previously handled by the failed node. This multiple computer environment provides many benefits to a user including high availability and increased speed of operation.
A typical network cluster configuration includes a plurality of nodes typically sharing one or more storage devices. The nodes are connected to each other by a high speed network connection such as ethernet.
A user can connect into the network cluster through any of the nodes in the network cluster. From the perspective of a user, the network cluster appears as a single computer system. Software applications run by a user are executed using the shared storage devices. An exemplary software application often executed on a computer network cluster is a database application. Typically, the database is stored on one or more shared storage devices. Inquiries or changes to the database are initiated by a user through any one of the cluster member nodes.
Successful operation of a network cluster requires coordination among the nodes with respect to usage of the shared resources as well as with respect to the communication between the nodes. Specifically, with multiple users manipulating shared data, precautions must be taken in a network cluster to insure the data is not corrupted. In addition, instances of nodes joining and exiting the network cluster must also be coordinated to avoid a loss of system integrity. Multiple safeguards have been instituted to aid in the prevention of a loss of system integrity.
One such safeguard may be instituted by the network cluster to handle cluster partitioning. Cluster partitioning results when the cluster network degenerates into multiple cluster partitions including a subset of the cluster network nodes, each cluster partition operating independently of each other. These partitions may be the result of miscommunication resulting from nodes joining or exiting the network cluster, the so-called partition-in-time problem.
The partition-in-time problem occurs when a node is absent from an operating network cluster for a period of time and the node has an invalid description of the operating parameters of the network cluster. For example, a network cluster is operating under a parameter where each node is scheduled to send heartbeat messages every second. A node, previously a member of the network cluster and currently rejoining the cluster, expects to send its heartbeat messages every three seconds as that had been the heartbeat messaging time interval when it was last a member of the cluster. As a result, the new node sends its heartbeat message every three seconds rather than every second. Accordingly, the network cluster continually attempts to resolve membership of the cluster as the remaining nodes assume the network connection to the new node was lost because a heartbeat message from the new node was not received by the remaining nodes every second. By continually resolving membership of the network cluster, user applications would be unnecessarily stalled and valuable system resources would be wasted.
To resolve such a problem, metadata is typically generated and stored which includes information that defines the cluster. Information typically included are the identities of the nodes which have permission to join the network cluster, the identity of the nodes which are currently operating in the network cluster, a time interval for sending heartbeat messages and the like. This metadata is provided to each member of the network cluster upon joining the cluster. In this manner a joining node would be made aware of the current operating parameters of the cluster upon joining.