File Systems
The term “file system” refers to the system designed to provide computer application programs with access to data stored on storage devices in a logical, coherent way. File systems hide the details of how data is stored on storage devices from application programs. For instance, storage devices are generally block addressable, in that data is addressed with the smallest granularity of one block; multiple, contiguous blocks form an extent. The size of the particular block, typically 512 bytes in length, depends upon the actual devices involved. Application programs generally request data from file systems byte by byte. Consequently, file systems are responsible for seamlessly mapping between application program address-space and storage device address-space.
File systems store volumes of data on storage devices. The term “volume” refers to the collection of data blocks for one complete file system instance. These storage devices may be partitions of single physical devices or logical collections of several physical devices. Computers may have access to multiple file system volumes stored on one or more storage devices.
File systems maintain several different types of files, including regular files and directory files. Application programs store and retrieve data from regular files as contiguous, randomly accessible segments of bytes. With a byte-addressable address-space, applications may read and write data at any byte offset within a file. Applications can grow files by writing data to the end of a file; the size of the file increases by the amount of data written. Conversely, applications can truncate files by reducing the file size to any particular length. Applications are solely responsible for organizing data stored within regular files, since file systems are not aware of the content of each regular file.
Files are presented to application programs through directory files that form a tree-like hierarchy of files and subdirectories containing more files. Filenames are unique to directories but not to file system volumes. Application programs identify files by pathnames comprised of the filename and the names of all encompassing directories. The complete directory structure is called the file system namespace. For each file, file systems maintain attributes such as ownership information, access privileges, access times, and modification times.
File systems often utilize the services of operating system memory caches known as buffer caches and page caches. These caches generally consist of system memory buffers stored in volatile, solid-state memory of the computer. Caching is a technique to speed up data requests from application programs by saving frequently accessed data in memory for quick recall by the file system without having to physically retrieve the data from the storage devices. Caching is also useful during file writes; the file system may write data to the memory cache and return control to the application before the data is actually written to non-volatile storage. Eventually, the cached data is written to the storage devices.
The state of the cache depends upon the consistency between the cache and the storage devices. A cache is “clean” when its contents are exactly the same as the data stored on the underlying storage devices. A cache is “dirty” when its data is newer than the data stored on storage devices; a cache becomes dirty when the file system has written to the cache, but the data has not yet been written to the storage devices. A cache is “stale” when its contents are older than data stored on the storage devices; a cache becomes stale when it has not been updated to reflect changes to the data stored on the storage devices.
In order to maintain consistency between the caches and the storage devices, file systems perform “flush” and “invalidate” operations on cached data. A flush operation writes dirty cached data to the storage devices before returning control to the caller. An invalidation operation removes stale data from the cache without invoking calls to the storage devices. File systems may flush or invalidate caches for specific byte-ranges of the cached files.
Many file systems utilize data structures called inodes to store information specific to each file. Copies of these data structures are maintained in memory and within the storage devices. Inodes contain attribute information such as file type, ownership information, access permissions, access times, modification times, and file size. Inodes also contain lists of pointers that address data blocks. These pointers may address single data blocks or address an extent of several consecutive blocks. The addressed data blocks contain either actual data stored by the application programs or lists of pointers to other data blocks. With the information specified by these pointers, the contents of a file can be read or written by application programs. When an application programs write to files, data blocks may be allocated by the file system. Such allocation modifies the inodes.
Additionally, file systems maintain information, called “allocation tables”, that indicate which data blocks are assigned to files and which are available for allocation to files. File systems modify these allocation tables during file allocation and de-allocation. Most modern file systems store allocation tables within the file system volume as bitmap fields. File systems set bits to signify blocks that are presently allocated to files and clear bits to signify blocks available for future allocation
The terms real-data and metadata classify application program data and file system structure data, respectively. In other words, real-data is data that application programs store in regular files. Conversely, file systems create metadata to store volume layout information, such as inodes, pointer blocks, and allocation tables. Metadata is not directly visible to applications. Metadata requires a fraction of the amount of storage space that real-data occupies and has significant locality of reference. As a result, metadata caching drastically influences file system performance.
Metadata consistency is vital to file system integrity. Corruption of metadata may result in the complete destruction of the file system volume. Corruption of real-data may have bad consequences to users but will not affect the integrity of the whole volume.
I/O Interfaces
I/O interfaces transport data among computers and storage devices. Traditionally, interfaces fall into two categories: channels and networks. Computers generally communicate with storage devices via channel interfaces. Channels predictably transfer data with low-latency and high-bandwidth performance; however, channels typically span short distances and provide low connectivity. Performance requirements often dictate that hardware mechanisms control channel operations. The Small Computer System Interface (SCSI) is a common channel interfaces. Storage devices that are connected directly to computers are known as direct-attached storage (DAS) devices.
Computers communicate with other computers through networks. Networks are interfaces with more flexibility than channels. Software mechanisms control substantial network operations, providing networks with flexibility but large latencies and low bandwidth performance. Local area networks (LAN) connect computers medium distances, such as within buildings, whereas wide area networks (WAN) span long distances, like across campuses or even across the world. LANs normally consist of shared media networks, like Ethernet, while WANs are often point-to-point connections, like Asynchronous Transfer Mode (ATM). Transmission Control Protocol/Internet Protocol (TCP/IP) is a popular network protocol for both LANs and WANs. Because LANs and WANs utilize very similar protocols, for the purpose of this application, the term LAN is used to include both LAN and WAN interfaces.
Recent interface trends combine channel and network technologies into single interfaces capable of supporting multiple protocols. For instance, Fibre Channel (FC) is a serial interface that supports network protocols like TCP/IP as well as channel protocols such as SCSI-3. Other technologies, such as iSCSI, map the SCSI storage protocol onto TCP/IP network protocols, thus utilizing LAN infrastructures for storage transfers.
The term “storage area network (SAN)” is used to describe network interfaces that support storage protocols. Storage devices connected to SANs are referred to as SAN-attached storage devices. These storage devices are block and object-addressable and may be dedicated devices or general purpose computers serving block and object-level data.
Block and object-addressable devices connect to SANs and share storage among multiple computers. Block-address devices are common storage devices that are addressable by fixed length data blocks or sectors. In contrast, object-addressable devices are impending devices that are addressable by an object identifier and an offset into the object. Each object-addressable device may support numerous objects. Two proposed object-addressable devices are the Seagate Object Oriented Device (OOD) and the Carnegie Mellon University Network Attached Secure Disks (NASD).
SANs are often dedicated networks specifically designed to transport block data; however, SANs may also operate as subsets of general purpose LANs and share the same physical network connections. Therefore, the type of data moving on the network dictates whether a network is a SAN or a LAN.
Local Files Systems
Local file systems service file-level requests for application programs only running on the same computer that maintains the non-shared file system volume. To achieve the highest levels of performance, local file systems extensively cache metadata and real-data within operating system buffer caches and page caches. Because local file systems do not share data among multiple computer systems, performance is generally very good.
Local file systems traditionally store volumes on DAS devices connected directly to the computer. The weakness of using DAS is that should the computer fail, volumes located on the DAS devices become inaccessible. To reclaim access to these volumes, the DAS devices must be physically detached from the original computer and connected to a backup computer.
SAN technologies enable local file system volumes to be stored on SAN-attached devices. These volumes are accessible to several computers; however, at any point in time, each volume is only assigned to one computer. Storing local file system volumes on SAN-attached devices rather than DAS devices has the benefit that the volumes may be easily reassigned to other computers in the event of failures or maintenance.
Distributed Files Systems
Distributed file systems provide users and application programs with transparent access to files from multiple computers networked together. Distributed file systems lack the high-performance found in local file systems due to resource sharing and lack of data locality. However, the sharing capabilities of distributed file systems often compensate for poor performance.
Architectures for distributed file systems fall into two main categories: network attached storage (NAS)-based and storage area network (SAN)-based. NAS-based file sharing, also known as “shared nothing”, places server computers between storage devices and client computers connected via LANs. In contrast, SAN-based file sharing, traditionally known as “shared disk” or “share storage”, uses SANs to directly transfer data between storage devices and networked computers.
NAS-Based Distributed File Systems
NAS-based distributed file systems transfer data between server computers and client computers across LAN connections. The server computers store volumes in units of blocks on DAS devices and present this data to client computers in a file-level format. These NAS servers communicate with NAS clients via NAS protocols. Both read and write data-paths traverse from the clients, across the LAN, to the NAS servers. In turn, the servers read from and write to the DAS devices. NAS servers may be dedicated appliances or general-purpose computers.
The Sun Microsystems Network File System (NFS) is a popular NAS protocol that uses central servers and DAS devices to store real-data and metadata for the file system volume. These central servers locally maintain metadata and transport only real-data to clients. The central server design is simple yet efficient, since all metadata remains local to the server. Like local file systems, central servers only need to manage metadata consistency between main memory and DAS devices. In fact, central server distributed file systems often use local file systems to manage and store data for the file system. In this regard, the only job of the central server file system is to transport real-data between clients and servers.
Central server designs were the first NAS-based distributed file systems. As the need for greater parallelism and enhanced availability grew, distributed file system designs evolved from central servers to multiple server configurations. As with central servers, multiple servers, also known as distributed servers, store all file system data on DAS devices connected to server computers. Since multiple servers cooperatively manage the file system, servers may share metadata between computers. The complexity of these designs increases an order of magnitude, since distributed system integrity requires strong metadata consistency between servers. Such systems often cannot use local file systems to store data. As a result, server software must manage, store, and transport metadata and real-data between servers. Two examples of distributed server file systems are the Andrew File System (AFS) from Carnegie Mellon University and the Sprite File System from the University of California at Berkeley.
Distributed server file systems have further evolved into designs where clients and servers are often difficult to distinguish. In these systems, clients manage, store, and transport metadata and real-data between servers and other clients. Coda from Carnegie Mellon University and the xFS File System from the University of California at Berkeley are two examples of merged client-server designs.
One aspect of NAS-based file system designs that has remained unchanged among central server, distributed server, and merged client-server designs is the direct attachment of storage devices to computers. With devices directly attached to computers, however, a single computer failure renders data stored on the storage devices inaccessible. Although redundant devices on separate computers can be added to improve availability, such techniques add complexity and cost to the system.
Furthermore, the NAS architecture limits performance when clients access data stored on remote devices, because the data-path between client and storage device includes server computers. These servers add overheads caused by server workloads as well as overheads relating to the translations from channel interface protocols to network interface protocols. Server computers designed to support large workloads are very expensive.
FIG. 1 illustrates the data-paths and components of a typical, prior art NAS-based file sharing environment 100. NAS clients 102 are connected to the NAS server 106 via network-based I/O interface links 110 connected to the LAN 104. The LAN 104 consists of network components such as routers, switches, and hubs. The NAS server 106 connects to DAS devices 108 via channel-based I/O interface links 112. The DAS devices 108 are block addressable, non-volatile storage devices. These interface links 110 and 112 include one or more physical connections.
The NAS read data-path 114 begins at the DAS devices 108 and leads to the NAS server 106. The read data-path 114 continues through the NAS server 106, across the LAN 104, to the NAS clients 102. Conversely, the NAS write data-path 116 begins at the NAS clients 102 and traverses through the LAN 104 to the NAS server 106. The NAS server 106, in turn, writes across the channel interface link 112 to the DAS devices 108.
SAN-Based Distributed Files Systems
Distributed file system designs that use SAN technologies have followed a different evolutionary path. Instead of storing data on storage devices connected directly to computers, SAN-based designs store data on SAN-attached devices shared among several client computers. SAN-based designs have high-bandwidth, low-latency data-paths between clients and devices.
SAN-based file systems require arbitration for the storage devices and consistency management of any data cached on the clients. Consistency mechanisms are either centrally located or distributed within the system. The consistency mechanisms may include software running on computers, hardware mechanisms attached to the networks, or a combination of both hardware and software.
There are two distinct SAN-based file system designs. The first design uses private file managers, in which client computers independently access metadata and real-data directly from the storage devices. Private file manager schemes do not require dedicated servers, since all necessary data is taken directly from the SAN-attached devices. With private file manager designs, clients only service local file requests. Examples of such systems include the Cray Research Shared File System, the Digital Equipment Corporation VAXcluster™, and the Global File System from the University of Minnesota.
As a result of their designs, clients utilizing private file managers remain independent from the failures and bottlenecks of other clients. Similarly, client resources such as memory, CPUs, and bus bandwidth are not spent servicing requests from other clients. However, private file manager designs have several disadvantages. First, the designs can only support a primitive form of caching. Clients may only access data cached locally in memory or data stored on the SAN-attached devices; data cached in the memory of other clients is not accessible. The second disadvantage deals with complications encountered during failure recovery. Since clients are not aware of other clients, clients must indirectly determine data corruption caused by other client failures.
The second type of SAN-based distributed file system design utilizes file manager server computers. These file servers manage file system namespace and metadata. SAN clients make requests to the SAN servers, and the servers determine the location of real-data on SAN devices by examining and modifying file metadata. Once the location is determined, the servers either initiate transfers between clients and storage devices or inform the clients how to invoke the transfers. Servers must maintain and store metadata, manage real-data, and control transfers between clients and storage devices. These SAN-based file server designs suffer from many of the same difficulties as NAS architectures. The server design is complex, since servers need to provide a great deal of functionality. Servers that fail or become overworked tend to disrupt file system operation. The SANergy file system from Tivoli Systems, the CentraVision File System (CVFS) from Advanced Digital Information Corporation (ADIC), and the Celerra HighRoad multiplex file system (MPFS) from EMC Corporation are examples of SAN-based file systems that utilize SAN server file managers to facilitate file transfers between SAN devices and SAN clients.
FIG. 2 illustrates the data-paths and components of a typical, prior art SAN-based file sharing environment 120. SAN clients 122 are connected to the SAN server 124 via network-based I/O interface links 110 connected to the LAN 104. The LAN 104 consists of network components such as routers, switches, and hubs. Typically only control and consistency information passes across the LAN 104. In some SAN-based file system designs, the SAN server 124 and the LAN 104 are unnecessary. In other designs, the SAN-based file system may actually utilize the services of a NAS-based file system to pass control information between the servers 124 and clients 122. Regardless of the control data-path, SAN clients 122 access all real-data via SAN protocols.
The SAN clients 122 and the SAN server 124 connect to the SAN-attached devices 126 via channel-based I/O interface links 130 capable of transferring storage protocols over network connections. As with the LAN links 110, the channel links 130 include one or more physical connections. The I/O interface 130 links connect to the SAN 128, which consists of network components such as routers, switches, and hubs. The SAN 128 may also include components that perform storage virtualization, caching, and advanced storage management functions. The SAN-attached devices 126 are typically block addressable, non-volatile storage devices. SAN-attached devices 126 may also support object-addressable interfaces. SAN-attached devices 126 often have multiple ports that connect via channel links 130 to the SAN 128.
The SAN read data-path 132 begins at the SAN devices 126, passes across the SAN 128, and leads to the SAN clients 122 and the SAN server 124. The SAN write data-path 134 begins at the SAN clients 122 and the SAN server 124 and passes through the SAN 128 to the SAN-attached devices 126.
SAN-Based File Sharing using Local File Systems
Local file systems may be used in SAN file sharing environments 120 under various restrictions. For instance, most local file system volumes may be mounted by multiple SAN clients 122 as long as all clients 122 mount the volume in read-only mode. Since the volume does not change, caching performed by the clients 122 does not affect the state of the SAN environment 120. When files of the volume need to be modified, however, all clients 122 must unmount the volume and then one client 122 re-mounts the volume in read-write mode. This client 122 makes the appropriate modifications and then unmounts the volume. Finally, all clients 122 re-mount the volume in read-only mode. This scheme promotes high-speed file sharing yet is tremendously restrictive and inefficient with respect to modifying volumes.
Some local file systems are specifically designed to support SAN file sharing environments 120 where one SAN client 122 mounts the volume in read-write mode and all other SAN clients 122 mount the volume read-only. These SAN-based local file system must frequently flush dirty caches on the read-write client 122 and regularly invalidate stale caches on the read-only clients 122. Given that only one computer is capable of modifying the volumes, this solution lacks transparency required by most applications and thus possess limited usefulness.
SAN Clients that Serve NAS Clients
A SAN-based file sharing environment 120 may be configured to serve a large number of NAS client computers 102 using NAS file sharing protocols. SAN clients 122 act as NAS servers 106 that serve volumes stored on the SAN-attached devices 126 to a large number of NAS clients 102 connected to the NAS servers 106 though LANs 104. Such systems, also known as clusters, combine SAN and NAS technologies into a two tiered scheme. In effect, a NAS cluster can be viewed as a single, large NAS server 106.
SAN Appliances
SAN appliances are prior art systems that consist of a variety of components including storage devices, file servers, and network connections. SAN appliances provide block-level, and possibly file-level, access to data stored and managed by the appliance. Despite the ability to serve both block-level and file-level data, SAN appliances do not possess the needed management mechanisms to actually share data between the SAN and NAS connections. The storage devices are usually partitioned so that a portion of the available storage is available to the SAN 128 and a different portion is available for NAS file sharing. Therefore, for the purpose of this application, SAN appliances are treated as the subsystems they represent.
FIG. 3 illustrates an example of a SAN appliance 136 that possess an internal SAN 138 that shares data between SAN-attached devices 126, the NAS server 124, and the SAN 128 external to the appliance 136. The appliance 136 serves block-level data, through channel-based interface links 130, to the SAN 128. From the perspective of the SAN, the appliance 136 appears as a prior art SAN-attached device 126. The appliance 136 also serves file-level data, through network-based interface links 110, to the LAN 104. From the perspective of the LAN, the appliance 136 appears as a prior art NAS server 124.
Another adaptation of a SAN appliance is simply a general purpose computer with DAS devices. This computer converts the DAS protocols into SAN protocols in order to serve block-level data to the SAN 128. The computer may also act as a NAS server 124 and serve file-level data to the LAN 104.
File System Layering
File system designers can construct complete file systems by layering, or stacking, partial designs on top of existing file systems. The new designs reuse existing services by inheriting functionality of the lower level file system software. For instance, NFS is a central-server architecture that utilizes existing local file systems to store and retrieve data from storage device attached directly to servers. By layering NFS on top of local file systems, NFS software is free from the complexities of namespace, file attribute, and storage management. NFS software consists of simple caching and transport functions. As a result, NFS benefits from performance and recovery improvements made to local file systems.
Other examples of file system layering include adding quota support to existing file system, strengthening consistency of cached data in an existing distributed file system, and adding compression or encryption to file systems without such support.
Installable File System Interfaces
Most modern operating systems include installable file system interfaces to support multiple file system types within a single computer. In UNIX, the Virtual File System (VFS) interface is an object-oriented, installable interface. While several UNIX implementations incorporate VFS, the interfaces differ slightly between platforms. Several non-UNIX operating systems, such as Microsoft Windows NT, have interfaces similar to VFS.
VFS occupies the level between the system call interface and installed file systems. Each installed file system provides the UNIX kernel with functions associated with VFS and vnode operations. VFS functions operate on whole file systems to perform tasks such as mounting, unmounting, and reading file system statistics. Vnode operations manipulate individual files. Vnode operations include opening, closing, looking up, creating, removing, reading, writing, and renaming files.
Vnode structures are the objects upon which vnode functions operate. The VFS interface creates and passes vnodes to file system vnode functions. A vnode is the VFS virtual equivalent of an inode. Each vnode maintains a pointer called “v_data” to attached file system specific, in-core memory structures such as inodes.
Many file system interfaces support layering. With layering, file systems are capable of making calls to other file systems though the virtual file system interface. For instance, NFS server software may be implemented to access local file systems through VFS. In this manner, the server software does not need to be specifically coded for any particular local file system type; new local file systems may be added to an operating system without reconfiguring NFS.