Implementing computing systems that manage large quantities of data and/or service large numbers of users often presents problems of scale. For example, as demand for various types of computing services grows, it may become difficult to service that demand without increasing the available computing resources accordingly. To facilitate scaling in order to meet demand, a particular computing service might be implemented as a distributed application that executes on a number of instances of computing hardware. For example, a number of different software processes executing on different computer systems may operate cooperatively to implement the computing service. When more service capacity is needed, additional hardware or software resources may be deployed.
However, implementing distributed applications may present its own set of challenges. For example, in a geographically distributed system, it is possible that different segments of the system might become communicatively isolated from one another, e.g., due to a failure of network communications between sites. As a consequence, the isolated segments may not be able to coordinate with one another. If care is not taken in such circumstances, inconsistent system behavior might result (e.g., if the isolated segments both attempt to modify data to which they would ordinarily coordinate access).
More generally, the larger the distributed system, the more difficult it may be to coordinate the actions of various actors within the system (e.g., owing to the difficulty of ensuring that many different actors that are potentially widely distributed have a consistent view of system state). In some environments, a group of resources may be organized as a cluster to provide infrastructure services, such as general-purpose state coordination functionality, to other distributed applications. For example, such a cluster may comprise a plurality of logical or physical compute resources, linked by an appropriate interconnect, working together to provide read and write access to a shared repository containing state information for various other applications and services of the distributed environment. In order to ensure continued provision of the state coordination service by such a cluster, the health of the cluster may need to be monitored using a reliable monitoring mechanism.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.