Various forms of network data storage systems are known today. These forms include network attached storage (NAS), storage area networks (SANs), and others. Network storage systems are commonly used for a variety of purposes, such as providing multiple users with access to shared data, backing up critical data (e.g., by data mirroring), etc.
A network storage system typically includes at least one storage server, which is a processing system configured to store and retrieve data on behalf of one or more client processing systems (“clients”). Some network storage systems may include two clustered storage servers, such as in a cluster failover system configuration. In accordance with failover methodologies, should a conventional storage server in a cluster failover system configuration fail, a partner storage server initiates a takeover of the volume(s) that are normally serviced by the failed storage server. A clustered system includes two or more nodes, with attached storage, and a cluster interconnect. When a taken-over storage server reboots, the cluster failover system typically has two fencing mechanisms that prevent this node from booting all the way up and trying to serve data. First, disk reservations can be placed on the storage devices associated with a failed storage server by the partner storage server to prevent access to the storage devices by the failed storage server. In particular, the takeover node places disk reservations on the storage devices by issuing a command to the storage devices. The disk reservations are configured to indicate ownership of data access control of the data on the storage devices. Second, takeover state information can be written to an on-disk area known to the clustering code in the storage devices associated with the clustered node(s). The on-disk area may be a disk containing the cluster information. This on-disk area that includes the clustering information and the takeover state information is referred herein as mailbox disks in storage server cluster applications. The contents of the mailbox disks tell the partner node that has failed that it has been taken over.
When the failed partner node reboots after being taken over, it first encounters disk reservations and goes to a waiting state, and waits for the partner node to give back control of the data. When the disk reservations are cleared, the failed partner node reads the contents of the clustering disk area (e.g., mailbox disks). From that data, the failed partner node determines that it is still taken over. However, there are situations where the failed partner node does not realize it has been taken over, and thus, incorrectly proceeds with booting and attempting to access and serve the same data as the partner node. This is referred to as a split-brain conflict. The split-brain conflict may cause data corruption due to both the storage servers taking ownership of the same data. Other problems that can result from the two clusters attempting to access and serve the same data are that it may cause the node that is the takeover node to fail and stop serving data, resulting in two failed nodes, or in a worst case scenario, data corruption may occur.
Conventionally, when the failed node reboots it sees the disk reservations and goes into the waiting state, waiting for the partner node to give back control until the reservations have cleared. If for some reason, such as early-release of the disk reservations or the failed node gets past this check incorrectly, the on-disk area that includes the clustering information and the takeover state information (e.g., mailbox disks) should still indicate that this node has been taken over and the node goes into a wait state, such as a mailbox wait state. However, the following conditions describe a situation where these two fencing mechanisms may not be adequate.
The first condition is when the failed node is booting up and goes into the waiting for giveback state, and the partner node, seeing that the failed node is in the waiting for giveback state, releases reservations on the failed node, allowing the failed node to further boot. This helps reduce the wait time for the process of giving back control to the failed node. The release of disk reservations before the node continues to boot is called early-release of disk reservations. The second condition is when the storage devices containing the cluster information are discovered late. Storage devices can be discovered late, due to storage devices being slow to spin-up and go online, or storage loops being offline or otherwise inaccessible. Sometimes the disk discovery mechanism has problems and not all the storage devices are discovered in the first pass of searching for the storage devices that contain the cluster information. If the storage devices containing the cluster information (e.g., mailbox disks) are not part of the first set of disks, the booting node attempts to find alternate on-disk area that may include out-dated clustering information and the takeover state information, which do not contain information that the node was taken over. Upon failure to find alternate on-disk areas that may include out-dated clustering information, the booting node may create new cluster state information, which does not contain information about the node being taken over.
The split-brain conflict, which includes both storage servers taking ownership of the same data, occurs when the two conditions described above occur, namely the disk reservations have already been released, and the on-disk area that includes the clustering information and the takeover state information (e.g., mailbox disks) are not found. These two scenarios cause the failed node to boot, leading to split-brain conflict, which can result in the node in takeover failing, which causes data to become unavailable, and potentially other problems.