Conventionally, a heartbeat mechanism typically is utilized by nodes in a cluster to detect any failure in a peer node. Particularly, nodes in a cluster exchange heartbeat messages with a pre-configured retry count and at a pre-configured retry intervals. Therefore, the customer bears the burden to configure the appropriate values for the retry count and intervals for heartbeat messages.
Moreover, configuring the retry count and retry intervals for heartbeat messages between any two nodes can be difficult. If a node is pre-configured with a high retry count and/or low retry intervals, the node will be able to detect a peer node's failure fast. However, exchanging heartbeat messages too frequently with the high retry count and low retry intervals will cause the node to transmit too many messages, and may also lead to false positive results. A false positive result occurs when a peer node is experiencing a delay due to temporary network congestion but is deemed as a failed node, because the node is not able to receive a reply heartbeat message from the peer node within the low pre-configured interval time period.
Also, the number of configuration parameters increase linearly with the addition of every node in the cluster. Thus, the customers either opt to configure the retry count and intervals for every pair of nodes in a cluster independently, which can become quite tedious, or to configure a global retry count and interval, which is sub-optimal because nodes in a cluster carry different load and delay characteristics.