The present invention is generally directed to a system and method for controlling access to one or more files in a storage area network, or similar system, which are to be accessed from a number of independent processor nodes. More particularly, the present invention is directed to a method for limiting access to files for a relatively short duration when node failures occur, one of the consequences of node failure being the partitioning of the nodes into a plurality of partitions. Even more particularly, the method of the present invention responds to a division of the nodes into a plurality of distinct group of nodes (partition), each of which is subsequently associated with a quorum value, so as provide a process which insures the continued presence of reliable agreement on the grant of access to one or more of the files and/or disk drives attached to the system in a shared file structure. The duration of the access limitation is controlled, at least in part, through the utilization by each node of time limited access grants which are periodically renewed, but which are still required to be active (unexpired) for file access. This limited time access, referred to herein as a lease mechanism, is granted by a lease manager running on one of the nodes. Logging operations that have occurred prior to node failure are used in a recovery process to provide continuing operations without data corruption. In the present invention, operations are permitted to continue even in the event that there is a failure of the node on which the lease manager is running. Furthermore, the present invention prohibits access for as small a time as is reasonably practical.
A proper understanding of the operation of the present invention is best grasped from an appreciation of the environment in which it operates. Accordingly, a description of such an environment is now provided. Further useful background information is found in U.S. Pat. No. 6,032,216 (xe2x80x9cParallel File System with Method Using Tokens for Locking Modesxe2x80x9d filed Jul. 11, 1997 and assigned to the same assignee as the present application) which is hereby incorporated herein by reference.
Shared disk file systems allow concurrent access to data contained on disks attached by some form of Storage Area Network (SAN). SANs provide physical level access to the data on the disk to a number of systems. The shared disks are often split into partitions which provide a shared pool of physical storage but which do not inherently have common access. In order to provide such access a shared disk file system or database manager is required. Coherent access to all of the data from all of the processor nodes is provided by the SAN. IBM""s GPFS (General Parallel File System) is a file system which manages a pool of disks and disk partitions across a number of systems. GPFS allows high speed direct access from any system and provides performance levels across a single file system which exceeds that available from any file system managed from a single processor system node.
In the GPFS shared disk file system, each node (with each node having one or more data processors) has independent access to the disks. Consistency of data and metadata (to be described more particularly below) is maintained through the use of a distributed lock manager (or token manager). A problem occurs when one of the processors fails (due to either software or to hardware problems), leading to the loss of the processor and/or to the loss of communications capability which is needed to participate in lock management (via the lock management protocol in place within the system). Therefore, a need exists for recovery mechanisms to allow all surviving processors to execute safely using the shared disks and to allow any failed processor to return to a known state. For further background information in this regard see xe2x80x9cParallel File System and Method for Independent Metadata Loggingxe2x80x9d (U.S. Pat. No. 6,021,508 filed Jul. 11, 1997 and assigned to the same assignee as the present application).
The recovery model described in many systems of this kind assumes the existence of the capability for blocking access from a given processor to a given disk so that the disk subsystem ceases to honor disk requests from the failed processor. The failed processor in such systems will not be able to access the shared disks (that is, it is xe2x80x9cfenced offxe2x80x9d), even if it has not yet detected the communication failure. Often this fencing capability is provided by the disk hardware support (for example, via the SCSI persistent reserve protocol or via the Storage System Architecture (SSA) fencing operation) but there exist disk drivers where such capability is not available. In these cases, a software method has to be provided as a mechanism for blocking failed processors from improperly accessing shared file and/or data resources. The present invention fulfills this need.
In the context of shared disk parallel file systems, the present invention is targeted at mechanisms for fencing off a failed node from accessing the shared disks. If one of the nodes fails or can no longer participate in the consistency protocol, one of the purposes of the present invention is to make sure that it will no longer write to any of the disks before log recovery for that node is initiated.
In accordance with a preferred embodiment of the present invention there is provided a method for controlling access to files in a storage area network or similar arrangement in which a plurality of data processing nodes seek access to common files stored under control of other data processing nodes. In the event of a node failure, the nodes are partitioned into a plurality of distinct groups of nodes, also referred to herein as partitions within a node set. For each partition the number of nodes that constitute a quorum is determined. The time at which node failure is reported to each partition is also noted. As long as a node has an unexpired xe2x80x9cleasexe2x80x9d (permission) to access a given file and no node failures have occurred file access is granted and data is either read or written, or both. Additionally, grant of access also implies the possibility of modifying metadata which is associated with the file. The grant of access is determined from within a partition in which a quorum of nodes is available and in which these nodes agree. Based upon (1) the times that node failure is reported to the different partitions, (2) the duration, D, granted for the xe2x80x9cleasexe2x80x9d and (3) the preferable use of a system safety margin time value, M, further access to the file from the failed node is prohibited. Put another way, the failed node is thus xe2x80x9cfenced offxe2x80x9d from access to the file for a time sufficient for proper recovery operations to be performed based on transaction logs that are maintained for this purpose. This assures not only that no corruption of data occurs, but that the system recovers in as little time as possible. It also provides a program solution which extends across various disk hardware platforms. It is also noted that while the method and system of the present invention is particularly described herein in terms of disk drives or DASD storage units, the invention is equally applicable to optical storage media and even to systems of tape or cartridge storage. In fact, the nature of the information storage medium is essentially irrelevant other than that it possess a file structure.
Accordingly, it is an object of the present invention to provide enhanced protection against corruption of both data and metadata in a storage area network or similar system of distributed data processing nodes.
It is also an object of the present invention to provide storage area networks which can be constructed from disk drive systems based on several different hardware protocols for data access, such as SCSI and SSA and which also includes arrays of storage devices attached via optical fiber connections, including the Fiber Channel architecture.
It is yet another object of the present invention to meld together the concepts of node leasing and node quorums across a partitioned arrangement of nodes to assure data consistency.
It is a still further object of the present invention to insure that recovery from node failure occurs as quickly as possible.
It is also an object of the present invention to provide an enhanced role and opportunity for data logging and recovery operations.
It is yet another object of the present invention to extend the capabilities of parallel file systems.
It is a still further object of the present invention to improve the operating characteristics of storage area networks in the face of node failures and partitioning operations.
It is also object of the present invention to better manage a pool of disks and disk partitions across a number of systems while still permitting high speed, direct access.
It is yet another object of the present invention to protect both data and metadata from inconsistencies arising due to node failures.
Lastly, but not limited hereto, it is object of the present invention to provide a software mechanism for blocking access from failed nodes.
The recitation herein of a list of desirable objects which are met by various embodiments of the present invention is not meant to imply or suggest that any or all of these objects are present as essential features, either individually or collectively, in the most general embodiment of the present invention or in any of its more specific embodiments.