A server cluster is a group of at least two independent servers connected by a network and managed as a single system. The clustering of servers provides a number of benefits over independent servers. One important benefit is that cluster software, which is run on each of the servers in a cluster, automatically detects application failures or the failure of another server in the cluster. Upon detection of such failures, failed applications and the like can be quickly restarted on a surviving server, with no substantial reduction in service. Indeed, clients of a Windows NT cluster believe they are connecting with a single, physical system, but are actually connecting to a service which may be provided by one of several systems. To this end, clients create a TCP/IP session with a service in the cluster using a known IP address. This address appears to the cluster software as a resource in the same group (i.e., a collection of resources managed as a single unit) as the application providing the service. In the event of a failure the cluster service "moves" the entire group to another system.
Other benefits include the ability for administrators to inspect the status of cluster resources, and accordingly balance workloads among different servers in the cluster to improve performance. Dynamic load balancing is also available. Such manageability also provides administrators with the ability to update one server in a cluster without taking important data and applications offline. As can be appreciated, server clusters are used in critical database management, file and intranet data sharing, messaging, general business applications and the like.
While clustering is thus desirable in many situations, problems arise if the servers (nodes) of the cluster become inconsistent with one another with respect to certain persistent cluster information. For example, memory state information, properties of the cluster or its resources and/or the state and existence of components in the cluster need to be consistent among the cluster's nodes. When such information is modified, the modifications occur on a local node and are often multiple in nature, e.g., an update of a registry, an update of data on a disk and an update of the state of a resource may take place at essentially the same time as a result of some modification of a resource or set of resources. Accordingly, to remain consistent, each time a modification of such local data occurs, some mechanism has to provide appropriate modification information to the other nodes so that they can update their local databases.
However, different nodes can fail at different times, potentially making some nodes inconsistent. For example, only part of a multiple update may be processed by a node before that node fails. Moreover, if one of the nodes does not receive or will not accept the change, then that node will be inconsistent with the rest of the cluster. As a result, simply broadcasting individual change information to each of the nodes and hoping that each node receives and processes the change is inadequate. In sum, there has heretofore been no adequate way in which to consistently replicate multiple associated modifications to the nodes of a server cluster.