Field of the Disclosure
Embodiments presented herein generally relate to distributed computing systems and, more specifically, to dynamically changing members of a consensus group in a distributed self-healing coordination service.
Description of the Related Art
A computing cluster is a distributed system of compute nodes that work together to provide a service that can be viewed as a singular system to nodes outside the cluster. Each node within the cluster can provide the service (or services) to clients outside of the cluster.
A cluster often uses a coordination service to maintain configuration information, perform health monitoring, and provide distributed synchronization. Such a coordination system needs to have a consensus group to reach consensus on values collectively. For example, the consensus group may need to determine which process should be able to commit a transaction to a key value store, or agree which member of the consensus group should be elected as a leader. The consensus group includes of a set of processes that run a consensus algorithm to reach consensus. Traditionally, members of the consensus group in a distributed computing system are fixed as part of the system's external configuration. Failure of a member may not completely prevent operation of the coordination service, but it does decrease the level of fault tolerance of the system. The consensus group has traditionally been unable to automatically add new members to the consensus group if a member of the consensus group has failed. Instead, previous solutions required the failed member of the consensus group to be serviced and restored by an administrator to bring the system back to a steady state.