Information drives business. Companies today rely to an unprecedented extent on online, frequently accessed, constantly changing data to run their businesses. Unplanned events that inhibit the availability of this data can seriously damage business operations. Additionally, any permanent data loss, from natural disaster or any other source, will likely have serious negative consequences for the continued viability of a business. Therefore, when disaster strikes, companies must be prepared to eliminate or minimize data loss, and recover quickly with useable data.
Companies have come to rely upon high-availability clusters to provide the most critical services and to store their most critical data. In general, there are different types of clusters, such as, for example, compute clusters, storage clusters, scalable clusters, and the like. High-availability clusters (also known as HA Clusters or Failover Clusters) are computer clusters that are implemented primarily for the purpose of providing high availability of services which the cluster provides. They operate by having redundant computers or nodes which are then used to provide service when system components fail. Normally, if a server with a particular application crashes, the application will be unavailable until someone fixes the crashed server. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as Failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate file systems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.
HA clusters are often used for critical databases, file sharing on a network, business applications, and customer services such as electronic commerce websites. HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is multiply connected via storage area networks or Internet protocol-based storage. Additionally, HA clusters are often augmented by connecting them to multiple redundant HA clusters to provide disaster recovery options.
In a multi node clustering, the disks/logical unit numbers (LUNs) are shared across the nodes to provide data availability and to provide multi-point access for improved performance. In a clustered configuration, a path failure to a LUN would trigger a cluster wide input/output (I/O) failover protocol to choose the best common paths on all of the nodes of the cluster. With a large cluster (e.g., with 32 nodes), and with large number of LUNs (e.g., with 4000 LUNs) on the system, a path failure would trigger the protocol where the nodes of the cluster each choose the best available path for the LUN. Thus, if one path to all 4000 LUNs fails, the notes comprising the cluster generate a large number of network messages that are exchanged between them for each path. The resulting traffic has a negative impact on failover performance because the large numbers of messages impose very high central processing unit (CPU) usage during this protocol activity.