1. Field of the Invention
Embodiments of the present invention generally relate to a computer clusters. More particularly, embodiments of the present invention relate to a method and apparatus for efficiently handling a network partition of a computer cluster using coordination point devices to enhance split brain arbitration.
2. Description of the Related Art
A computer cluster is a configuration that comprises a plurality of computers (e.g., server computers, client computers and/or the like) as well as various cluster resources (e.g., applications, services, file systems, storage resources, processing resources, networking resources and/or the like). The various cluster resources may be managed through one or more service groups. Occasionally, the computer cluster must handle one or more split brain scenarios amongst the various cluster resources.
Generally, a split brain occurs when two independent systems (e.g., one or more computers and a portion of the various cluster resources) configured in the computer cluster assume exclusive access to a particular cluster resource (e.g., a file system or volume). The most serious problem caused by the network partition is the affect on the data stored in the shared storage (e.g., disks). Typical failover management software uses a predefined method (e.g., heartbeats) to determine if a node is “alive”. If the node is alive, the system recognizes that the portion of the cluster resources cannot be safely taken over or controlled. Therefore, multiple systems are online and have access to an exclusive cluster resource simultaneously, which causes numerous problems such as data corruption, data inconsistency and malicious attacks.
Hence, the split brain occurs when the predefined method to determine node failure is compromised.
The methodology of avoiding and/or mitigating split brain scenarios within the computer cluster is Input/Output (I/O) Fencing. I/O fencing may also be known as disk fencing or failure fencing. When a computer cluster node fails, the failed node needs to be fenced off from the cluster resources (e.g., shared disk devices, disk groups or volumes). The main function of I/O fencing includes preventing updates by failed instances and split brains in cluster. I/O fencing may be executed by the VERITAS Cluster Volume Manager (CVM) in association with a shared storage unit. Furthermore, VERITAS Cluster File System (CFS) plays a significant role in preventing the failed nodes from accessing shared cluster resources (e.g., shared devices).
For example, in a SUN computer cluster, I/O fencing is done through Small Computer System Interface-2 (SCSI-2) Reservation for dual hosted SCSI devices and SCSI-3 PR for a multi-hosted SCSI environment. VERITAS Advance Cluster uses the SCSI-3 persistent reservation (PR) to perform I/O fencing. In the case of LINUX clusters, the CFS (e.g., POLYSERVE and SISTINA Global File System (GFS)) are employed to perform I/O fencing by using different methods such as fabric fencing, which uses Storage Area Network (SAN) access control mechanisms. SCSI-2 Reservation and SCSI-3 Persistent Reservation are inefficient and limited in various respects. SCSI-2 Reservation is not persistent and SCSI-3 PR is limited to one computer cluster for arbitration. Furthermore, SCSI-3 PR in Campus Clusters requires a third coordinator disk to be placed on a third site, which requires expensive SAN infrastructure on the third site.
Alternatively, coordination point devices (e.g., servers) may perform I/O fencing without the above limitations for SCSI-3 PR. Generally, the coordination point devices use an arbitration technique to determine which computer cluster node to fence from the cluster resources. Unfortunately, the coordination point devices are unable to handle scenarios where the coordination point device fails or is inaccessible (e.g., cannot connect with a client computer). Furthermore, the coordination point devices may choose a smaller sub-cluster of nodes to survive over a larger sub-cluster of nodes simply because the smaller sub-cluster started an arbitration race before the larger sub-cluster. Hence, an arbitration request initiated by the smaller sub-cluster reaches the coordination point server before the larger sub-cluster arbitration request.
Therefore, there is a need in the art for a method and apparatus for partitioning a computer cluster through coordination point devices that provide enhanced split brain arbitration during I/O fencing.