The present invention is generally directed to multi-node data processing systems which share access to at least one disk storage unit, or other form of nonvolatile memory, such as a rotating optical medium. More particularly, the present invention is directed to multi-node data processing systems in which groups of processor nodes are established for the purpose of carrying out various tasks. Even more particularly, the present invention is directed to a method and system for providing nonconcurrent shared disk recovery in a two node data processing system when a quorum of nodes is not present.
The Recoverable Virtual Shared Disk (RVSD) product (marketed and sold by International Business Machines, Inc., the assignee of the present invention) provides nonconcurrent virtual shared disk access and recovery. As used herein, “nonconcurrent” means that disk access is not granted to the same disk simultaneously from two different nodes. The present invention is specifically directed to the situation in which two nodes are present. In such cases, one of the nodes is designated as the primary server for managing access to shared disks which contain data. When the primary disk server fails, the backup disk server automatically and transparently takes over control of disk access management thus allowing a shared disk application such as the IBM General Parallel File System (GPFS) to continue to run uninterrupted. The Recoverable Virtual Shared Disk product implements this recovery using Group Services, a component of the Reliable Scalable Cluster Technology (RSCT) present in the assignee's pSeries of data processing product, to monitor a group of networked nodes for node failure.
The quorum concept is employed in multi-node data processing systems to handle a network partition such as might occur as the result of a communication failure. In a data processing system having n nodes, a quorum sufficient for further system operation is typically set at n/2+1, so that in the case of a network partition, the node group that forms with the majority of nodes stays “up” and the other group is deactivated. This provides a consistent recovery state so that only one server attempts to takeover the shared disks. However, using the same quorum value and algorithm for a system having only two nodes, results in a quorum of two, which implies that either both nodes stay up or both nodes go down. This is not an acceptable choice. Thus, one can not in general use the quorum concept as part of a recovery method if there are only two nodes. Without a quorum, when there is a node failure notification in a two node system, one doesn't know if the other node has failed or if there has been a network partition.
This problem has been solved in the past by requiring a third node to act as a tiebreaker, but then you don't actually have a two-node system. The present invention avoids this and still does not require the use of a third node.