This invention relates to monitoring the health of a cluster of servers that provide services to users. More specifically, this invention relates to using such health information to facilitate a recovery during a user dialog with a server in view of a failure of the server which had been supporting the dialog.
Heartbeats have been typically utilized by a single monitoring node to determine the health of other nodes in the network. The single monitoring node may periodically transmit inquiries to each of the nodes being monitored with the expectation of receiving a reply from each within a known time to confirm the health of each node.
Detecting the failure of a node by its missing heartbeat at the monitoring node permits the latter to implement alternative actions. For example, the monitoring node may redirect future service requests directed to the failed node to another node. Such action may be sufficient where the service request represents a new initial request for service or is a stand-alone request that is independent of past history involving the failed node. However, as recognized as part of the present invention, redirecting a service request sent to a failed node to another node does not represent an effective solution where the service request is dependent on prior information stored at or exchanged with the failed node, i.e. where the prior history of communications with the failed node is required to process the current request such as in an ongoing dialog. Thus, a need exists for a better recovery technique when a service node fails, especially where a user request is dependent on past communications with the failed node.