Distributed computing systems have found application in a number of different computing environments, particularly those requiring high performance and/or high availability and fault tolerance. In a distributed computing system, multiple computers connected by a network are permitted to communicate and/or share workload. Distributed computing systems support practically all types of computing models, including peer-to-peer and client-server computing.
One particular type of distributed computing system is referred to as a clustered computing system. “Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a client or user, the nodes in a cluster appear collectively as a single computer, or entity. In a client-server computing model, for example, the nodes of a cluster collectively appear as a single server to any clients that attempt to access the cluster.
Clustering is often used in relatively large multi-user computing systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
An important feature often required in clusters as well as other distributed computing systems is the distribution of state data among the various computing devices participating in the system. In a clustered computing system, for example, each node or member typically has locally-generated data that contributes to the overall “state” of the cluster. Much of this state data is specific to the member, but can be utilized by other members (which may include other nodes of the cluster and/or any clients that attempt to access the cluster).
The member-specific state data may related to practically any information that contributes to the overall state of the distributed computing system. For example, the state data may be performance-related, e.g., in terms of current processing throughput, load level, communications throughput. The state data may also be configuration- or capability-related, e.g., in terms of available services, available connections, endpoint configuration data, supported protocols.
In some instances, member-specific state data may be relatively static, and may not change significantly over time. Other member-specific state data may be relatively dynamic, and may only be interesting or useful as long as the member is active in the system. For example, performance-related member-specific state data may be used by a load balancing algorithm to determine the relative workloads of multiple nodes in a cluster, and thus enable tasks to be routed to individual nodes to efficiently balance the overall workload among the nodes.
Timely and reliable distribution of member-specific state data is often required to ensure reliable operation of a distributed computing system. Traditionally, such distribution has been handled by collecting the member-specific state data in a central location, e.g., in a single computing device in a distributed computing system. Individual members report their respective member-specific state data to the single computing system, and members that desire to receive such state data are permitted to “subscribe” to receive such state data. Changes to the state data reported by a particular member are then automatically distributed to any members that are subscribed to receive such changes. The single computing device that distributes the state data is then required to monitor all of the other members, and to update any state data and subscriptions if any of the members fail and become inaccessible.
The use of a single computing device, however, represents a single point of failure, which is often undesirable, particularly in high availability environments. In addition, a single computing device may present a scaling problem as the amount of state shared increases.
As such, a significant need has existed for a more reliable and scalable manner of managing and distributing member-specific state data among multiple members in a distributed computing system.