The present disclosure relates generally to controlling access to files in a disk subsystem and, in particular, to limiting access to files.
A file system is a computer program that allows other application programs to store and retrieve data on media, such as disk drives. A file structure is the organization of data on the disk drives. A shared disk file system is one in which a file structure residing on one or more disks is accessible by multiple file systems running on shared computers or nodes which may include one or more data processing units. A multi-node file system is one that allows shared access to files that span multiple disk drives on multiple nodes or clusters of nodes.
In a multi-node file system, such as IBM's General Parallel File System (GPFS), one node in a cluster of nodes is designated as the manager of a file system. This node controls functions for that file system and is commonly referred to as the file system manager (fsmgr).
Occasionally, a node that is using the file system may fail. This may occur, for example, if a lease for network connection is not renewed for a node. When a node that is using the file system crashes or is declared dead due to loss of network connection, the fsmgr has to perform recovery actions to handle the failure of the node. This includes ensuring that the failed node will not perform any I/O operations after the rest of the nodes in the cluster recognize that it has failed. Ensuring the failed node cannot perform any I/O operations is achieved by “fencing” the failed node off.
Traditionally, the fsmgr handles fencing of the failed node by issuing fencing calls to the logical subsystem or disk drives to “fence off” the failed node from accessing the disks in the disk subsystem. This process of fencing is based on an inherent assumption that the logical disks or partitions in a disk subsystem are available/reachable/accessible by the file system manager. However, in some cases, the file system manager may not have access to all the disks in the subsystem, e.g., in the event of a path failure. In this case, the fencing command may not reach all the disks. This results in a fencing failure which, in turn, may result in corrupted data on the disk subsystem. This, in turn, may result in application failure and loss of availability of the data in the disk subsystem. To prevent this, the filesystem is unmounted on all nodes in the cluster. The issue becomes more of a problem in heterogeneous environments (e.g., mixed aix/linx/x86/ppc64 clusters), where the fsmgr may not have access to a disk or may lose access to disks. In such cases, where the fsmgr cannot directly issue fencing calls, there needs to be a way to handle fencing of a failed node.
There is thus a need for fencing a failed node to limit access by the failed node even in those cases in which the file system manager is unable to directly issue fencing calls to the disk subsystem.