Implementing computing systems that manage large quantities of data and/or service large numbers of users often presents problems of scale. As demand for various types of computing services grows, it may become difficult to service that demand without increasing the available computing resources accordingly. To facilitate scaling in order to meet demand, many computing-related services are implemented as distributed applications, each application being executed on a number of computer hardware servers. For example, a number of different software processes executing on different computer systems may operate cooperatively to implement the computing service. When more service capacity is needed, additional hardware or software resources may be deployed.
However, implementing distributed applications may present its own set of challenges. For example, in a geographically distributed system, it is possible that different segments of the system might become communicatively isolated from one another, e.g., due to a failure of network communications between sites. As a consequence, the isolated segments may not be able to coordinate with one another. If care is not taken in such circumstances, inconsistent system behavior might result (e.g., if the isolated segments both attempt to modify data to which access is normally coordinated using some type of concurrency control mechanism). The larger the distributed system, the more difficult it may be to coordinate the actions of various actors within the system (e.g., owing to the difficulty of ensuring that many different actors that are potentially widely distributed have a consistent view of system state). For some distributed applications, a state management mechanism that is itself distributed may be set up to facilitate such coordination. Such a state management mechanism, which may be referred to as a distributed state manager (DSM), may comprise a number of physically distributed nodes. The managed distributed application may submit requests for state transitions to the DSM, and decisions as to whether to commit or reject the submitted transitions may be made by a group of nodes of the DSM in at least some cases. Representations of committed state transitions may be replicated at multiple nodes of the DSM in some implementations, e.g., to increase the availability and/or durability of state information of the managed applications.
Of course, as in any distributed system, the components of the DSM may themselves fail under various conditions. In an environment in which communication latencies between DSM nodes may vary substantially, which may be the case depending on the nature of the connectivity between the nodes, determining whether the DSM itself is in a healthy state may not be straightforward. For example, messages between DSM nodes may be delayed due to a variety of causes, such as network usage spikes, DSM node CPU utilization spikes, or failures at the nodes or in the network, and verifying that a failure has indeed occurred may sometimes take a while. In some DSM implementations, manual intervention (e.g., based on alerts directed towards support staff when failure conditions are eventually detected) or other error-prone procedures may be required to recover from failures that affect multiple nodes.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.