In a clustered computing system that comprises two or more nodes, an application may be hosted by any of the nodes in the system. That is, an application may run in the form of a corresponding application instance on any of the nodes in the system. To increase resilience and availability of the system and applications hosted thereon, runtime states of application instances of the applications are often monitored. Correspondingly, in case that a particular application instance of an application fails, upon detecting such a failure, the system may attempt to start another application instance of the same application locally on the same node (where the particular application instance was previously running however unsuccessfully). If starting the other application instance of the application on the same node does not work, the system may attempt to start an application instance of the same application on a different node in the system. Restarting a failed application by starting an application instance of the application on a different node from a node where another application instance of the same application has failed is known as failover.
Failover is useful if restarting an application (or rather starting a new application instance of the application) on a particular node results in continuous failures because of some persistent problem inflicting the application on that particular node (the persistent problem may be a node level problem inflicting all applications on the node), but the same problem does not exist on a different node. This type of persistent problem may occur, for example, when the particular node does not have sufficient local system resources required by an application instance of the application. Since the local system resources are local to each node, the different node may very well have sufficient local system resources required by the application. Thus, failing over the application from the particular node to the different node in the form of starting a new application instance on the particular node may solve the node level problem (e.g., lack of local system resources) that may have inflicted the application on the particular node.
However, sometimes, an application instance of an application cannot be started on any of the nodes in the system, because of a (cluster level) problem inflicting all the nodes. For instance, configuration parameters in configuration files for the application on all the nodes may contain the same fatal error. As a result, the application cannot be started on any of the nodes in the system. Under these circumstances, if the system were to blindly apply the previously described failover procedure, the application would be needlessly and hopelessly failed over from one node to another, only resulting in thrashing in which one failure is (immediately) followed by another failure, repeatedly. The thrashing would cause system resources to be needlessly wasted while not improving time availability of the application.
To avoid such a problem, under these techniques, the number of failover attempts for any particular application in the clustered computing system must typically be bounded (or capped). For example, a particular application may be maximally allowed to attempt failovers only N times, say 5, within a failover interval, say one hour. Every time when a failover event relating to the particular application occurs, an event record is written to an event log. Such an event log is typically kept on disk, and stores at least all event records that occurred within the failover interval. Thus, when a new failover event for an application occurs when the application has failed to be restarted on a node (i.e., an application instance of the application cannot be successfully started on the node even after a number of retries), a decision maker, which may be in the form of a daemon located on one of the nodes in the system, may retrieve a sufficient number of event records from the event log, determine how many failovers have been attempted within the failover interval for the application, and, based on the information determined from the event log, further determine whether another failover should be attempted for the application.
As this discussion shows, when an application needs to be restarted, access to an event log is required under these techniques. However, since there may be deployed many applications in the system which require failover protection, the size of the event log may accordingly be very large. As a result, the failed application may not be promptly restarted since much time must first has been spent on examining past failure events in the event log.
Furthermore, the problem, described above, may get exacerbated if the access to the event log becomes unavailable at the time when a failed application needs to be restarted. This can happen, for example, when the failed application is related to providing database services. As a result, resilience and availability of applications in such a system may be adversely impacted.
Therefore, a better mechanism that would improve failing over applications in a clustered computing system is needed.