Information drives business. Companies today rely to an unprecedented extent on online, frequently accessed, constantly changing data to run their businesses. Unplanned events that inhibit the availability of this data can seriously damage business operations. Additionally, any permanent data loss, from natural disaster or any other source, will likely have serious negative consequences for the continued viability of a business. Therefore, when disaster strikes, companies must be prepared to eliminate or minimize data loss, and recover quickly with useable data.
Companies have come to rely upon high-availability clusters to provide the most critical services and to store their most critical data. In general, there are different types of clusters, such as, for example, compute clusters, storage clusters, scalable clusters, and the like. High-availability clusters (also known as HA Clusters or Failover Clusters) are computer clusters that are implemented primarily for the purpose of providing high availability of services which the cluster provides. They operate by having redundant computers or nodes which are then used to provide service when system components fail. Normally, if a server with a particular application crashes, the application will be unavailable until someone fixes the crashed server. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as Failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.
HA clusters are often used for critical databases, file sharing on a network, business applications, and customer services such as electronic commerce websites. HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is multiply connected via storage area networks or Internet protocol-based storage. Additionally, HA clusters are often augmented by connecting them to multiple redundant HA clusters to provide disaster recovery options.
The high availability and disaster recovery solutions strive to decrease the application downtime and application data loss. In case of a disaster like flood, earthquake, hurricane, etc., the applications running in the impacted cluster should be failed over to another cluster at the earliest to ensure that the business continuity is maintained. In order to facilitate fast failover of the applications, the cluster failures should be detected in the timely manner.
In high availability environments involving a cluster file system (CFS), when a NFS (network file system) server (e.g., a cluster node) crashes or NFS server needs to be relocated from one CFS node to another (also referred to hereafter as adoptive node), all cluster file system level-file lock operations needs to be paused till the time NFS server has completed failover. File lock operations are resumed only after the NFS server has completed its failover. When there are simultaneous failovers or cluster membership changes (due to joining or exit of a CFS node, also referred to hereafter as cluster reconfiguration of reconfiguration), which can be due to either reconfiguration or manual migration, a problem occurs when file lock processing resumes before all of the failovers due to both reconfigurations and manual migrations are over.