As is known in the art, a computer network cluster is a collection of interconnected computers which share resources such as data storage. The individual computers, or nodes, are connected through both a physical and a software-level interconnect. The independent nodes are integrated into a single virtual computer, appearing to an end user as a single computing resource. If one node fails, the remaining nodes will handle the load previously handled by the failed node. This multiple computer environment provides many benefits to a user including high availability and increased speed of operation.
A typical network cluster configuration includes a plurality of nodes typically sharing one or more storage devices. The nodes are connected to each other by a high speed network connection such as ethernet.
A user can connect into the network cluster through any of the nodes in the network cluster. From the perspective of a user, the network cluster appears as a single computer system. Software applications run by a user are executed using the shared storage devices. An exemplary software application often executed on a computer network cluster is a database application. Typically, the database is stored on one or more shared storage devices. Inquiries or changes to the database are initiated by a user through any one of the cluster member nodes.
Successful operation of a network cluster requires coordination among the nodes with respect to usage of the shared resources as well as with respect to the communication between the nodes. Specifically, with multiple users manipulating shared data, precautions must be taken in a network cluster to insure the data is not corrupted. In addition, instances of nodes joining and exiting the network cluster must also be coordinated to avoid a loss of system integrity. Multiple safeguards have been instituted to aid in the prevention of a loss of system integrity.
One such safeguard involves each node of the network cluster periodically informing each other node that it is operating and is still a member of the network cluster. This is typically accomplished by sending a message, generally known as a heartbeat message, from each node in the network cluster to each other node in the cluster.
If a first node fails to receive a message from a second node it regards as a member of the cluster, the first node signals the remaining nodes that cluster membership should be re-evaluated. The failure of the first node to receive the message may be indicative of the second node losing network connectivity with the first node. In an effort to resolve the network interconnectivity of the nodes, all user applications being executed on the network cluster are stalled and multiple messages are transferred among the member nodes. As a result of this messaging activity, the network interconnectivity of each member node is assessed.
The safeguard described above provides a limited solution to the problem described. For instance providing a heartbeat message from each node in the cluster to each other member node in the cluster creates considerable messaging traffic. In an N node cluster, (Nxe2x88x921)2 messages are sent at each specified time interval thus using valuable system resources.
The present system includes a method and an apparatus for operating a network cluster in a closed loop between node_1 to node N. Each node sends a single heartbeat message to the node ahead of it in the loop, i.e., node_1 sends a heartbeat message to node_2, node_2 to node_3, etc. Here, each node sends and receives a single message. If a node fails to receive a heartbeat message from its predecessor in the loop, it initiates a cluster reconfiguration by sending a reconfiguration message to each other node in the cluster. By operating the cluster in a closed loop, heartbeat message traffic is greatly reduced thus freeing valuable system resources.