In a high-availability configuration, a computer cluster may include multiple nodes configured to provide high availability of at least one application. Such a configuration may be implemented in an effort to ensure that at least one node within the computer cluster is capable of executing the application at any given time. For example, a cluster engine may select a particular node within the computer cluster to execute the application but later determine (whether accurately or in error) that the node is not healthy enough to continue executing the application. In this example, upon determining that the node is not healthy enough to continue executing the application, the cluster engine may direct the application to fail over from the apparently unhealthy node to another node within the computer cluster in an attempt to reduce application downtime.
Unfortunately, while existing cluster technologies may facilitate application failover from one node to another node, such cluster technologies may also suffer from one or more shortcomings and/or inefficiencies that potentially lead to application downtime. For example, existing cluster technologies may implement a heartbeat mechanism in which the nodes within the computer cluster send periodic heartbeats (via a high-availability agent) to the cluster engine to indicate that such nodes are healthy. In this example, if the node executing the application experiences a system failure shortly after sending a heartbeat, the cluster engine may remain unaware of the node's system failure until the next time that the cluster engine is expecting a heartbeat from the node. As a result, the cluster engine may be unable to direct the application to fail over to another node within the computer cluster (potentially leading to application downtime) until the cluster engine determines that the node executing the application failed to send a timely heartbeat.
In some examples, existing cluster technologies may also determine in error that the node executing the application is unhealthy. For example, even though the node executing the application is completely healthy, the node's CPU may become so overloaded that the node is unable to send a timely heartbeat to the cluster engine. In this example, as a result of the node's inability to send a timely heartbeat, the cluster engine may determine in error that the node is not healthy enough to continue executing the application. Upon determining in error that the node is not healthy enough to continue executing the application, the cluster engine may direct the application to fail over from the apparently unhealthy node to another node in an unnecessary attempt to reduce application downtime.
In another example, the cluster engine may queue a heartbeat received from the node executing the application. However, in this example, the cluster engine may become so overloaded that the cluster engine is unable to process the heartbeat received from the node within a predetermined time interval. As a result of the cluster engine's inability to process the heartbeat received from the node within the predetermined time interval, the cluster engine may determine in error that the node is not healthy enough to continue executing the application. Upon determining in error that the node is not healthy enough to continue executing the application, the cluster engine may direct the application to fail over from the apparently unhealthy node to another node in an unnecessary attempt to reduce application downtime.
As such, the instant disclosure identifies a need for systems and methods for efficiently and accurately determining the healthiness of nodes within computer clusters to avoid (or at least further reduce) application downtime.