1. Field of the Invention
The present invention relates to distributed data systems and, in particular, to providing high availability during performance of a cluster topology self-healing process within a distributed data system cluster.
2. Description of Related Art
Cooperating members, or nodes, of a distributed data system may form a cluster to provide transparent data access and data locality for clients, abstracting the possible complexity of the data distribution within the cluster away from the clients. Nodes may be servers, computers, or other computing devices. Nodes may also be computing processes, and thus multiple nodes may exist on the same server, computer, or other computing device.
A cluster may provide high availability by maintaining multiple copies of the same data and distributing these copies among the nodes included in the cluster. Upon failure of a node in the cluster and loss of at least one copy of a portion of the cluster's data, the cluster may repair the failure through a “self-healing” process. The repair typically involves replicating data that was stored on the failed node from a non-failed node, which also stores that data, onto another cluster node. Thus, the cluster topology self-healing process ensures that a desired number of copies of the data remain in the cluster. For example, two cluster nodes may store duplicates of the same data. In response to the failure of one of these two nodes, the non-failed node may duplicate the data onto a third node in order to ensure that multiple copies of data remain in the cluster and/or to maintain high availability.
During self-healing, a cluster may not be able to provide continuous service to clients. The data duplication may consume significant time and bandwidth, as well as other computing resources, which are then unavailable to respond to client requests. Moreover, allowing access to the data being duplicated may result in synchronization problems that affect data accuracy and consistency. One solution to such problems is to not allow client connection to the nodes involved in the cluster topology self-healing process. Alternatively, a cluster may attempt to provide continuous service to clients during self-healing, although performance may be limited. For example, while access to the nodes involved in the cluster topology self-healing process may be permitted, access to the data being replicated may not be permitted. In such a situation, client requests targeting the data that is being replicated may not be serviced or the client requests may be delayed until the cluster topology self-healing process is completed. Thus, all of these solutions negatively impact the quality of service provided to the client. Accordingly, it is desirable to be able to provide higher quality of service to clients during performance of a cluster topology self-healing process.