In modern computer systems, large collections of data are usually organized on disk storage as files. If the number of files is large, then the files may be distributed over multiple computer systems. Users'programs access the files by requesting file services from one or more file systems. The file systems also perform administrative actions such as controlling coherent access by the clients, communicating with physical storage components, maintaining redundant copies, and recovering from failure.
In most file systems, the files comprise user data and metadata. The metadata are all information required to manage the user data, such as names, locations, dates, file sizes, access protection, and so forth. The organization of the user data is usually managed by the client programs.
It is laborious to administer a large distributed file system that serves a large and growing user community. For instance, to store more files, and to serve more users, one must add more disks and more server computers. Each of these components requires human attention. To simplify the distribution of files, groups of files or "volumes" are often manually assigned to particular disks. Then, the files can manually be moved or replicated when components fill up, fail, or become throughput bound.
Joining many thousands of files distributed over many disks into a redundant array of independent disks (RAID) is only a partial solution; administration problems still arise when the system grows so large to require multiple RAIDs, and multiple server processors.
In the prior art, there are have been numerous attempts to construct distributed file systems that are scalable. Scalable in this context means that the file system can be adjusted to any desired size without changing the underlying architecture of the system. Some of these prior art file systems are now described to illustrate the need for a better scalable file system.
The Cambridge File Server (CFS), described by Birrell et al. in "A universal file server," IEEE Transactions on Software Engineering, SE-6(5):450-453, September 1980, takes a two-layered approach to building a distributed file system. There, the layers provide the users with two abstractions: files and indexes. File systems built on the two layers can use these abstractions to implement a distributed file system. As a characteristic, the CFS manages the entire distributed file system from a single server computer. Controlling data flow from a single server is simple, but in situations where a single server cannot handle the task, the CFS falls short. Also, a single server based system is vulnerable to failure.
The Network File System (NFS), as described by Sandberg et al. in "Design and implementation of the Sun network file system," Proceedings of the Summer USENIX Conference}, pages 119-130, June 1985, is not a file system in itself, but rather a remote file access protocol. The NFS protocol provides a weak notion of cache coherence, and its stateless design requires client users to make many unnecessary and frequent accesses to the servers to maintain a marginal level of coherence in the data.
The Andrew File System (AFS), described by Howard et al. in "Scale and performance in a distributed file system," ACM Transactions on Computer Systems, 6(1):51-81, February 1988, and its offshoot DCE/DFS as described by Kazar et el., in "DEcorum file system architectural overview," Proceedings of the Summer USENIX Conference, pp. 151-164, June, 1990, provides better cache performance and data coherence than NFS. AFS is designed for a different kind of scalability than will be described herein. The AFS has a global name space and security architecture that allows client computers to connect to many separate file servers using a wide area network.
The Echo file system described by Mann et al in "A coherent distributed file cache with directory write-behind," ACM Transactions on Computer Systems, 12(2):123-164, May 1994, is log-based. The Echo file system replicates data for reliability, and access paths are allowed to span multiple disks for availability. In addition, the Echo file system provides coherent caching.
However, the Echo file system cannot easily be scaled. There, each volume can only be managed by a single server computer. Failover, in the case of a hardware failure, can only be to a predetermined backup server. A volume can only span as many disks as can be connected to a single server. Although there is an internal layering of file services on top of a disk service, the Echo file system requires both layers to execute in the same address space on the same machine.
The VMS Cluster file system, described by Strecker et al. in "VAXclusters: A closely-coupled distributed system," ACM Transactions on Computer Systems, 4(2):130-146, May 1986, off-loads file system processing to individual servers that are members of a cluster, i.e., a plurality of closely-coupled computers.
Each server in the cluster executes its own instance of the file system program in conjunction with a shared physical disk. Synchronization is provided by a distributed lock service. The shared physical disk is accessed either through a special-purpose cluster interconnect (CI) to which a disk controller can be directly connected, or through an ordinary local area network (LAN) such as Ethernet, and a processor acting as a disk server.
The Spiralog file system described by Johnson et al. in "Overview of the Spiralog file system," Digital Technical Journal, 8(2):5-14, 1996, also off-loads processing of its file system to individual members of a cluster of interconnected servers that run above a shared storage system layer.
The interface between layers in the Spiralog file system differs from the VMS cluster file system because the lower layer is neither file-like, nor simply disk-like. Instead, Spiralog provides an array of stably-stored bytes, and permits atomic actions to update arbitrarily scattered sets of bytes within the array. Spiralog's split between layers simplifies the file system, but complicates the storage system considerably. Spiralog does not scale easily, nor does Spirolog tolerate hardware faults readily. A Spirolog volume can only span the disks connected to a single server, and the volume becomes unavailable when the server suffers a failure.
Though designed as a cluster file system, Calypso, described by Devarakonda et al. in "Recovery in the Calypso file system," ACM Transactions on Computer Systems, 14(3):287-310, August 1996, is more similar to Echo than the VMS cluster file system. Like Echo, Calypso stores its files on multi-ported disks, i.e., disks that can be accessed by multiple servers. One of the servers directly connected to each disk acts as a file server for data stored on that disk; when the server fails, another server takes over. Other servers in a Calypso cluster access the current server as file system clients. Like Echo, the client computers can maintain coherent caches using a multiple-reader/single-writer locking protocol.
Shillner et al., in a "Simplifying distributed file systems using a shared logical disk," Technical Report TR-524-96, Dept. of Computer Science, Princeton University, 1996, describe a distributed file system on top of a shared logical disk. There, a lower layer uses multiple servers cooperating to implement a single logical disk. In an upper layer, multiple independent servers execute the same file system code on top of the logical disk to provide access to shared files. However, the logical disk layer does not provide redundancy. The system can recover from a failure in a local server, but dynamic reconfiguration of other failed servers is not possible.
Their file system uses careful ordering of operations that write file metadata, but the writes are not logged. Their technique avoids the need for a full metadata scan to restore consistency after a server failure. Unfortunately the shared logical disk can lose track of free blocks after a server failure. This necessitates a time consuming garbage collection process to locate the free blocks.
The xFS file system, described by Anderson et al. in "Serverless network file systems," ACM Transactions on Computer Systems, 14(1):41-79, February 1996, distributes management responsibility for files over multiple servers and provides good availability and performance. However, xFS has a predesignated manager for each file, and the storage server is log-structured working independent of other servers. File system recovery and reconfiguration is not addressed.
An ideal distributed file system would provide all of its users with shared access to the same set of files. Access would be controlled in a coherent and transparent manner so that any users's view of any file at any one time is consistent with any other user's view. In addition, the distributed file system needs to be scalable to any arbitrary size to provide more storage space and higher performance as the need for data by an ever increasing number of users increases. The users would also like to have uninterrupted access to the data of the files, so high availability is a necessity, despite the fact that it is well known that hardware components can unpredictably fail at any time. In order to keep maintenance costs down, the distributed file system should require a minimal amount of human administration, and the complexity of the administration should not increase as more hardware components or users are added.