Implementing computing systems that manage large quantities of data and/or service large numbers of users often presents problems of scale. As demand for various types of computing services grows, it may become difficult to service that demand without increasing the available computing resources accordingly. To facilitate scaling in order to meet demand, many computing-related services are implemented as distributed applications executed on a number of computer hardware servers. For example, a number of different software processes or nodes executing on different computer systems may operate cooperatively to implement the computing service. When more service capacity is needed, additional hardware or software resources may be deployed.
For some types of distributed services implemented at multiple nodes, one or more of the nodes may serve as a leader or coordinator of the service. For example, a leader node may be responsible for receiving service requests from clients, and orchestrating the execution of the work required to fulfill the service request by farming out respective tasks to the other (non-leader) nodes of the service. The role of the leader may be dynamically assignable in some services, e.g., so that, in the event of a failure of the current leader, a new leader can be selected and the service can continue to process client requests. In some cases, a role manager separate from the distributed service may be responsible for selecting the leader node, e.g., in response to leadership assignment requests from the service nodes. In order to support high availability for the distributed service, the role manager itself may be designed to be fault-tolerant. For example, the role manager itself may comprise a cluster of network-connected role manager nodes which use a quorum-based protocol or algorithm for selecting the leader.
In distributed systems such as multi-node services or multi-node role managers, different segments of the system might become communicatively isolated from one another, e.g., due to a failure of network communications between nodes, or due to failures of some of the nodes themselves. If the current leader node of the service fails or becomes disconnected, but the role manager remains accessible, a new leader node may be selected fairly quickly with minimal impact on the service. However, in some large-scale failure scenarios, the role manager may also fail or become disconnected (e.g., in addition to the leader node of the service), which may potentially lead to a more serious impact on the service. Furthermore, the various nodes of the service and/or the role manager may come back online in unpredictable sequences after such failures. Orchestrating a clean recovery from large scale failures remains a challenging problem for at least some types of distributed services.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.