Implementing computing systems that manage large quantities of data and/or service large numbers of users often presents problems of scale. As demand for various types of computing services grows, it may become difficult to service that demand without increasing the available computing resources accordingly. To facilitate scaling in order to meet demand, many computing-related services are implemented as distributed applications, each application being executed on a number of computer hardware servers. For example, a number of different software processes executing on different computer systems may operate cooperatively to implement the computing service. When more service capacity is needed, additional hardware or software resources may be deployed.
However, implementing distributed applications may present its own set of challenges. For example, in a geographically distributed system, it is possible that different segments of the system might become communicatively isolated from one another, e.g., due to a failure of network communications between sites. As a consequence, the isolated segments may not be able to coordinate with one another. If care is not taken in such circumstances, inconsistent system behavior might result (e.g., if the isolated segments both attempt to modify data to which access is normally coordinated using some type of concurrency control mechanism). The larger the distributed system, the more difficult it may be to coordinate the actions of various actors within the system (e.g., owing to the difficulty of ensuring that many different actors that are potentially widely distributed have a consistent view of system state). For some distributed applications, a state management mechanism that is itself distributed may be set up to facilitate such coordination. Such a state management mechanism, which may be referred to as a distributed state manager (DSM), may comprise a number of physically distributed servers. The managed distributed application may submit requests for state transitions to the DSM, and in some implementations decisions as to whether to commit or reject the submitted transitions may be made by a group of servers of the DSM referred to as a “jury”. Representations of committed state transitions may be replicated at multiple nodes of the DSM in some implementations, e.g., to increase the availability and/or durability of state information of the managed applications.
Of course, as in any distributed system, the servers of a DSM may themselves fail under various conditions. In an environment in which communication latencies between DSM servers may vary substantially, which may be the case depending on the nature of the connectivity between the servers, determining whether the DSM itself is in a healthy state (e.g., with a sufficient number of jurors to make state transition decisions) may not be straightforward. In at least some DSM implementations, jury members may be selected dynamically in an automated and distributed fashion by the DSM servers themselves, with each server involved in the jury selection process acting on the basis of potentially out-of-date information, and each proposed change to the jury requiring approval by the current jury before the change is committed. In such environments, selecting and implementing jury membership changes to improve the overall availability and failure resilience of the DSM may be a non-trivial exercise.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.