Modern computing systems, particularly those employed by larger organizations and enterprises, continue to increase in size and complexity. Particularly, in areas such as Internet applications, there is an expectation that millions of users should be able to simultaneously access that application, which effectively leads to an exponential increase in the amount of content generated and consumed by users, and transactions involving that content. Such activity also results in a corresponding increase in the number of transaction calls to databases and metadata stores, which have a limited capacity to accommodate that demand.
In order to meet these requirements, a distributed data management and cache service can be run in the application tier so as to run in-process with the application itself, e.g., as part of an application server cluster. However, from time to time, one or more of the server machines in the application server cluster can be shut down, and/or the processes running on top of the server machines can be dysfunctional. There is a need to quickly detect such an event when it happens. This is the general area that embodiments of the invention are intended to address.