The present invention relates general to computer file systems. More specifically, the present invention involves a distributed file system based on two technologies: shared storage and file system layering.
The term xe2x80x9cfile systemxe2x80x9d refers to the system designed to provide computer applications with access to data stored on storage devices in a logical, coherent way. File systems generally hide the details of how data is stored on a storage device from the application program. For instance, data on a storage device is generally block accessible, in that data is addressed with the smallest granularity of a block; with multiple blocks forming an extent. The size of the particular block depends upon the actual device involved. Application programs generally request data from file systems byte by byte. Consequently, file systems are responsible for seamlessly mapping between application program memory space and the storage device address space.
Application programs store and retrieve data from files as contiguous, randomly accessible segments of bytes. Users are responsible for organizing data stored in these files, since file systems are generally not concerned with the content of each file. With a byte-addressable address space, users may read and write data at any offset within a file. Users can grow files by writing data to the end of a file. The size of the file increases by the amount of data written. Conversely, users can truncate files by reducing the file size to a particular length.
To maximize storage efficiency, file systems place xe2x80x9cholesxe2x80x9d in areas within files that contain no data. Holes act as space holders between allocated sections of user data. File systems must manage holes, though no data is allocated to the holes until users write data to the location. When a user reads from a hole, the file system fills the user buffer with zeros.
A hole can either occupy space within an allocated block or occupy space of entire blocks. File systems manage block aligned holes in a manner similar to real-data blocks, yet no blocks are allocated. File systems manage holes internal to allocated blocks simply by zeroing the space of the hole.
In addition, file systems are generally responsible for maintaining a disk cache. Caching is a technique to speed up data requests from application programs by saving frequently accessed data in solid-state memory for quick recall by the file system without having to physically retrieve the data from the storage device. Caching is also useful during file writes; file system may write user data to cache memory and complete the request before the data is actually written disk storage.
Additionally, file systems maintain information indicating which data blocks are available to be allocated to files. File systems modify these free lists during file allocation and de-allocation. Most modern file systems manage free lists by means of bitmap tables. File systems set bits to signify blocks that are allocated to files.
File systems present data to application programs as filesxe2x80x94contiguous, randomly accessible segments of bytes. These files, called regular files, are presented to application programs through directory files which form a tree-like hierarchy of files and subdirectories containing more files. The complete directory structure is called the file system name space. Link files are a third type of file used to provide multiple file names per physical file.
File systems are required to map this application level interface to the often non-contiguous data blocks stored on the storage device. Generally, information required to map a particular file or directory to the physical locations of the storage device is stored by the file system in an inode within a data block. Inodes contain information, called attributes, about a particular file, such as file type, ownership information, access permissions and times, and file size. Inodes also contain a list of pointers which address data blocks. These pointers may address single data blocks or address an extent of several consecutive blocks. The addressed data blocks contain either actual data or a list of other pointers. With the information specified by these pointers, the contents of a file can be read or written by an application program. When an application program writes to a file, data blocks may be allocated by the file system. Such allocation modifies the inode.
The terms meta-data and real-data classify file system structure data and user data, respectively. In other words, real-data is data that users store in regular files. Other terms for real-data include user data and file data. File systems create meta-data to store layout information, such as inodes and free block bitmap tables. Meta-data is not directly visible to users. Meta-data requires a fraction of the amount of storage space that real-data occupies and has significant locality of reference. As a result, meta-data caching drastically influences file system performance.
Meta-data consistency is to vital file system integrity. Corruption of meta-data may result in the complete destruction of the file system. Corruption of real-data may have bad consequences to users but will not effect the integrity of the whole file system.
File systems can generally be divided into two separate types. Local file systems allow computers to access files and data stored on locally attached storage devices. While local files systems have advanced significantly over the years, such file systems have limited usefulness when data needs to be shared between multiple computers. Distributed files systems have been developed in order to make shared data available to multiple computer systems over a computer network. Distributed file systems provide users and applications with transparent access to files and data from any computer connected to the file system. Distributed file system performance cannot equal local file system performance due to resource sharing and lack of data locality.
Traditional distributed file systems are based on client-server architectures. Server computers store shared data on locally attached storage devices, called server-attached devices. Clients send file system requests to server computers via networks. Early distributed file systems, such as Sun Microsystems Network File System (NFS), use a central server to store real and meta-data for the file system. These central servers locally maintain meta-data and transport only real-data to clients. The central server design is simple yet efficient, since all meta-data remains local to the server. Like local file systems, central servers only need to manage meta-data consistency between main memory and storage devices. In fact, central server distributed file systems often use local file systems to manage and store meta-data for the file system. In this regard, the only job of the central server file system is to transport real-data between client and server.
As the need grew for greater parallelism and enhanced availability, distributed file system designs evolved from central servers to multiple server configurations. As with central servers, multiple servers, also known as distributed servers, store all file system data on devices connected to server computers. Since multiple servers cooperatively manage the file system, servers may share meta-data between computers. The complexity of these designs increases an order of magnitude, since distributed system integrity requires strong meta-data consistency between servers. Such systems cannot use local file systems to store data. As a result, server software must manage, store, and transport meta-data between servers. Two examples of distributed server file systems are the Andrew File System from Carnegie Mellon University and the Sprite File System from the University of California at Berkeley.
Distributed server file systems have further evolved into designs where clients and servers are often difficult to distinguish. In these systems, clients manage, store, and transport real-data and meta-data between servers and other clients. Coda from Carnegie Mellon University and the xFS File System from the University of California at Berkeley are two examples of merged client-server designs.
One aspect of client-server file system designs that has remained unchanged among central server, distributed server, and merged client-server designs is the local attachment of storage devices to computers. Unfortunately, this architecture has performance and availability weaknesses. With devices attached to computers, a computer failure renders data stored on the storage device inaccessible. Although redundant devices on separate computers can be added to improve availability, such a technique adds complexity and cost to the system.
Furthermore, the architecture limits performance when clients access data stored on remote devices. The data-path between client and storage device includes a server computer. This server adds overheads caused by server workload and overheads relating to storage device interface to network interface protocol translations. Server computers designed to support large workloads are very expensive.
Distributed file system designs that use shared storage, or shared disk, technologies have followed a slightly different evolution path. Instead of storing data on storage devices connected locally to computers, shared storage designs store data on devices shared between client computers. Shared storage systems have a short data-path between clients and devices.
These distributed system require arbitration for the storage devices and consistency management of any data cached on the clients. Consistency mechanisms are either centrally located or distributed within the system. The consistency mechanisms may include software running on computers, hardware mechanisms attached to the networks, or a combination of both.
Two distinct file system designs utilize shared storage technology. The first case uses private file managers, in which client computers independently access meta-data and real-data directly from the storage devices. Private file manager schemes do not require dedicated file servers, since all necessary data is taken directly from the shared storage devices. With private file manager designs, each client views storage as locally attached. Clients only service local file requests. No direct communication is needed between clients. Such systems are often derived from modified local file systems. Examples of such systems include the Cray Research Shared File System, the Digital VAXcluster, and the Global File System from the University of Minnesota.
As a result of their designs, clients utilizing private file manages remain independent from the failures and bottlenecks of other clients. Similarly, client resources such as memory, CPUs, and bus bandwidth are not spent servicing requests from other clients. However, private file manager designs do have several disadvantages. First, the designs can only support a primitive form of caching. Clients may only access data cached locally in memory or stored on the shared devices; data cached in the memory of other clients is not accessible. The second disadvantage deals with complications encountered during recovery. Since clients are not aware of other clients, clients must indirectly determine data corruption caused by other client failures.
The second type of shared storage distributed file system design utilizes file manager server computers. These file servers manage file system directory structures and meta-data on non-shared storage devices. Clients make requests to the servers, the servers determine the location of real-data on shared devices by calling and examining meta-data from the non-shared storage device. Once the location is determined, the servers either initiate transfers between clients and storage devices or inform clients how to invoke the transfer. Servers must maintain and store meta-data, manage real-data, and control transfers between clients and storage devices. These shared storage designs suffer from many of the same difficulties as client-server architectures based upon server-attached disks. The server design is complex, since servers need to provide a great deal of functionality. Servers that fail or become overworked tend to disrupt file system operation. Since this form of distributed file system differs considerably from other shared storage designs, these designs can be classified as shared file manager, shared storage systems. The HPSS/SIOF project at Livermore National Laboratories is an example that uses a shared file manager to facilitate transfers between storage servers and clients.
I/O interfaces transport data between computers and devices as well as among computers. Traditionally, interfaces fall into two categories: channels and networks. Computers generally communicate with storage devices via channel interfaces. Channels predictably transfer data with low-latency and high-bandwidth performance; however, channels span short distances and provide low connectivity. High-performance requirements often dictate that hardware mechanisms control channel operations.
Computers communicate with other computers through networks. Networks are interfaces with more flexibility than channels. Software controls substantial network operations, providing networks with flexibility but low performance.
Recent interface trends combine channel and network technologies into single interfaces capable of supporting multiple protocols. For instance, Fibre Channel (FC) is an emerging ANSI serial interface that supports channel and network operations. Fibre Channel supports traditional network protocols like Transmission Control Protocol/Internet Protocol (TCP/IP); Fibre Channel supports traditional channel protocols such as Small Computer System Interface (SCSI-3). Combined interfaces allow shared storage file systems to have high connectivity, connect long distances, and operating in unpredictable environments. A new term for I/O interfaces that support shared storage is storage area network (SAN). Shared storage devices that connect to SANs are also referred to as network attached storage (NAS) devices. The term NAS device refers to extent addressable storage systems connected to a network.
File system designers can construct complete file systems by layering, or stacking, partial designs on top of existing systems. The new designs reuse existing services by inheriting functionality of lower levels. For instance, NFS is a central-server architecture that utilizes an existing local file system to store and retrieve data on a storage device attached locally to the server. By layering NFS on top of local file systems, NFS software is free from the complexities of name space, file attribute, and storage management. NFS software consists of simple caching and transport functions. As a result, NFS benefits from performance and recovery improvements made to local file systems.
Other examples of file system layering include adding quota support to existing file system, strengthening consistency of cached data in an existing distributed file system, and a file system layer that compresses or encrypts files for a file system without such support.
Most modem operating systems include installable file system interfaces to support multiple file system types within a single computer. In UNIX, the Virtual File System (VFS) interface is an object-oriented interface that supports various file system types within a single operating system. VFS occupies the level between the user/system call interface and installed file systems. Each installed file system provides the UNIX kernel with functions associated with VFS and vnode operations. VFS functions operate on whole file systems and perform tasks such as mounting, unmounting, and reading status. Vnode operations manipulate individual files. Vnode operations include opening, closing, creating, removing, reading, writing, and renaming files.
Vnode structures are the objects upon which vnode functions operate. A vnode is the VFS virtual equivalent of an inode. VFS creates and passes vnodes to file system vnode functions. Each vnode includes a pointer, called v_data, for file systems to attach private structures such as inodes.
While several UNIX implementations incorporate VFS, the interfaces differ slightly between platforms. Several non-UNIX operating systems, such as Microsoft Windows NT, have interfaces similar to VFS. Installable file system interfaces such as VFS allow multiple file system types within an operating system. Each system is capable of making calls to other file systems though the virtual file system interface. For instance, an NFS server may be implemented to access a local file system through VFS. In this manner, the server software does not need to be specifically coded for the local file system type; new file systems may be added to an operating system without reconfiguring NFS.
The present invention is a shared storage distributed file system that provides users and applications with transparent access to shared data stored on network attached storage devices. The file system uses layering techniques to inherit file management functionality from existing systems. Meta-data in the present invention is stored and shared among multiple computers by storing the meta-data as real-data in regular files of a standard, non-modified, client-server distributed file system. In effect, the standard client-server file system serves as the meta-data file system (MFS) for the present invention.
Real-data is stored on network attached storage devices attached to a storage area network. SFS benefits from direct network device attachment, since NAS devices off-load time-consuming data transfers from server computers. Furthermore, client computers operating under the present invention store file system meta-data on a meta-data file system. Using this meta-data, clients manage real-data stored on the network attached storage devices. The meta-data file systems also maintain the present file system name space and file attributes.
By utilizing an existing client-server system as a meta-data file system, the present invention is able to utilize the small-file access speed, consistency, caching, and file locking that is built into modem client-server file systems. Not only is development work reduced, but implementation is also simplified. Furthermore, future advances in client-server architectures are able to be incorporated easily and quickly.