Modern high availability cluster servers rely on built-in locking mechanisms that are typically present in distributed systems such as cluster file systems, distributed file systems, network file systems, and so on. The locking mechanism is typically lease based, so as to deal with possible server crashes. When a service is running or executing on a server (node), the service holds an exclusive lock on resources (e.g., data files) used by that service. For example, an SQL service running or executing on a Microsoft Windows® server may hold locks on one or more SQL database files. A virtual machine (VM) running on a VMware ESXi® server (host machine) may hold a lock on a VM virtual disk file, and so on. To keep this exclusive lock on the resources, the operating system (OS) kernel on the server sends regular heartbeats to the storage (file) server that contains the resources.
If an instance of a service running or executing on a node in the cluster, or the node itself, dies or otherwise fails, the lock should expire after some time. The storage server allows another node in the cluster to break the lock after the current lock holder fails to renew the lock. The storage server typically provides a grace period, allowing for the possibility that the service has not in fact failed but rather has experienced a delay that it can recover from. Accordingly, the storage server may allow for several missed heartbeats before releasing the lock. Failover to another instance of the service (e.g., on a failover node) can only be initiated after this grace period has passed, since the failover node will not be able to acquire a lock on the resources for the failover service prior to that time.
The grace period differs among systems. In some systems, for instance, the grace period is 35-45 seconds in duration; e.g., three heartbeats with ten seconds between heartbeats and a graceful wait time of about 15 seconds after that for good measure. In other systems, the time is about 15-20 second, at three heartbeats with five seconds between beats and a graceful wait time, and so on. While this delay can seem to be a short amount of time, a 45-second delay can be unacceptable in a high-availability system when access times are typically measured in tens to hundreds of milliseconds.
As slow as the delay may be, the delay is nonetheless important in order to reduce the likelihood of a false positive indication of a failure. For example, the cluster can include a cluster server that monitors each of the services and/or nodes in the cluster to determine whether a service is alive (up and running) or not. If the cluster server determines that the initial instance of a service is no longer alive (indication of a failure) then it can initiate failover processing to bring up failover instance of the service. If the indication is a false positive (false alarm) because the initial instance of the service is actually alive, the service will still have a lock on its resources which will prevent those resources from being opened and modified by the failover service, thus maintaining data consistency of the resources.
On the other hand, if the initial instance of the service is in fact no longer alive, then the failover service will incur a delay equal to the grace period before the storage system will release the lock and grant a lease to the failover service. The delay is further increased due to the startup time of the failover service before it can start servicing requests from the user.