1. Field of the Invention
The present invention relates to distributed data systems and, in particular, to handling temporary changes in membership within a distributed data system cluster.
2. Description of Related Art
Cooperating members, or nodes, of a distributed data system may form a cluster to provide transparent data access and data locality for clients, abstracting the possible complexity of the data distribution within the cluster away from the clients. Nodes may be servers, computers, or other computing devices. Nodes may also be computing processes, and thus multiple nodes may exist on the same server, computer, or other computing device.
A cluster may provide high availability by replicating data on one or more of the nodes included in the cluster. Upon failure of a node in the cluster, the cluster may repair the failure through a “self-healing” process to maintain high availability. The repair typically involves duplicating data that was stored on the failed node from a non-failed node, which also stores that data, onto another cluster node. Thus, the healing process ensures that a desired number of copies of the data remain in the cluster. For example, two cluster nodes may store duplicates of the same data. In response to the failure of one of these two nodes, the non-failed node may duplicate the data onto a third node to ensure that multiple copies of data remain in the cluster and to maintain high availability.
Healing may affect the distribution of data in the cluster nodes. For example, before a node failure, the data in a cluster is often distributed proportionately among the nodes to provide load balancing. The replication algorithm used to heal the cluster may distribute the data stored on the failed node among the failed node's neighboring nodes, resulting in uneven data distribution. To complicate matters even more, the failed node may return to the cluster relatively quickly (e.g., if the failure was due to transitory communication problems or a quick restart). However, once the healing process has been performed, it may be unduly complicated to return to the original even data distribution. As a result, if several temporary failures occur, the data distribution within the node may become so unbalanced that cluster performance suffers.