Distributed file systems are traditionally built around central file servers, which manage and control access to files stored on disk. Clients send file system commands, such as create, read and write, over a network to be executed on the server. Data transfers to and from the disk pass through the server memory. Examples of distributed file systems include Sun Microsystems' Network File System (NFS™), Novell Netware™, Microsoft's Distributed File System, and IBM/Transarc's DFS™. As file systems and storage networks grow, the file server increasingly becomes a bottleneck in storage access and limits scalability of the system.
In response to this problem, new parallel-access, shared storage systems have been developed, which allow applications on multiple client nodes in a network to share storage devices and file data without mediation of a central file server as in traditional distributed file systems. These systems typically reduce server workload by distributing its functionality among other components—server cluster, clients and disks. An example of a file system for this type of shared storage is IBM's General Parallel File System (GPFS), which is a UNIX-style file system designed for IBM RS/6000 multiprocessor computing platforms. GPFS is described, for example, in a publication entitled “General Parallel File System (GPFS) 1.4 for AIX: Architecture and Performance,” which is available at www-1.ibm.com/servers/eserver/clusters/whitepapers/gpfs_aix.html. GPFS is based on a shared disk model that provides low-overhead access to disks not directly attached to the application nodes, using a cluster of file servers to provide high-speed access to the same data from all nodes of the system.
The need for a locking mechanism is common to distributed shared-storage file systems known in the art, in order to maintain atomicity of operations, and thus ensure full data coherence. In the context of the present patent application and in the claims, an operation is said to be performed atomically if from the point of view of the system state, the operation has either been completed, effectively instantaneously, or if not, the operation has not occurred at all. Locking may be performed either at a file server or lock server, or at the storage devices themselves, and may be either centralized or distributed. GPFS, for example, uses a distributed, token-based locking protocol for this purpose. A token manager grants lock tokens to client nodes upon request, and revokes them when other nodes make conflicting requests. A node can read or write file data or metadata only after it obtains the proper token.
As another example, the Global File System (GFS) uses a locking mechanism maintained by the storage device controllers. GFS is described by Soltis et al., in “The Global File System,” Proceedings of the Fifth NASA Goddard Space Flight Center Conference on Mass Storage Systems and Technologies (College Park, Maryland, 1996), which is incorporated herein by reference. Other systems use group communication messaging protocols or lock servers. In any case, the overhead associated with locking prevents such shared-storage systems from growing beyond several hundred nodes.
Modern disks used in shared storage systems are typically independent units, with their own computational power. This computational power can be used to take over some of the functions previously performed by servers, such as allocation and protection. In this vein, object-based storage devices (OBSDs) are being developed to move low-level storage functions into the storage device itself, and thus to permit clients to access the device through a standard object interface rather than a traditional block-based interface. (Essentially, an OBSD can be constructed by layering a thin operating system on top of a conventional disk machine.) This higher-level storage abstraction enables cross-platform solutions by pushing the low-level functions down to the device —functions that would normally be implemented differently on different platforms. Furthermore, the direct-access nature of OBSDs enables scalable, high-performance solutions, as there are no potential bottlenecks in the system between the hosts and the storage devices. The basic concepts of OBSDs (also known as OSDs) are described at www.snia.org/English/Work_Groups/OSD/index.html.
OBSDs are particularly useful in storage area networks (SANs), in which disks and clients communicate directly over a network, without intervening servers. Gibson et al., for example, describe an implementation of OBSDs for this purpose in “File Systems for Network-Attached Secure Disks” (1997), which is incorporated herein by reference. This publication is available at www.pdl.cmu.edu/publications/index.html#NASD. A network-attached secure disk (NASD) drive, like other OBSDs, stores variable-length, logical byte streams called objects. Client file systems wanting to allocate storage for a new file request one or more objects to hold the file's data. Read and write operations apply to a byte region (or multiple regions) within an object. The layout of an object on the physical media is determined by the NASD drive and is transparent to the client.
In multiprocessor operating systems, read-modify-write (RMW) operations are used to solve problems of synchronization in access to shared memory. RMW operations read data, update it and write it back to the memory atomically, so that the processor performing the operation does not need a lock to ensure data consistency. A particular RMW operation, known as load-linked store-conditional, was defined by Herlihy, in “A Methodology for Implementing Highly Concurrent Data Objects,” ACM Transactions on Programming Languages and Systems 15:5 (November, 1993), pages 745-770, which is incorporated herein by reference. To update a data structure indicated by a pointer, the processor first copies it into a new allocated block of memory, makes changes on the new version, and switches the pointer to the new version if appropriate conditions are met.