Generally speaking, in the computer arts, a file system is the mechanism that defines the storage, hierarchical organization, access and operations for retrieval and manipulation of data on a storage device or medium (e.g., a hard disk or a CD-ROM). Computer systems using UNIX or Linux operating systems require a file system called the “root filesystem”. The root filesystem includes programs and information to start up and manage the subject computer system. The root filesystem is formed of a hierarchical tree of directories 91 and files 89 where the root 93 of this tree is referred to as “root” and denoted by “/”. FIG. 1 shows a typical UNIX root filesystem 87.
In UNIX and Linux terminology there are several file types, namely, “regular”, “directory”, “symbolic link”, “block and character oriented device file” and “special”. A “regular” file, commonly called “file” is a named container of information stored in a directory. A “directory” file is a container of directory entries, for each file in the subject directory. Each directory entry includes a name and a set of attributes that point to a file (for example its inode, see below). A “symbolic link” is a file containing a pointer to a different file which can be used through the symbolic link. This file type serves to overcome the limitations of hard links which are discussed later. A “Block and character oriented device file” is not a container of sorts but rather an interface to an I/O device that can be accessed directly through the file. “Special” files (pipes and sockets) are files used for interprocess communication and networking.
As mentioned above, directories are files that contain directory entries that point to an inode. As a consequence, one file on a disk may be pointed to by different directory entries. It is said that this file has different hard links. In order to maintain multiple hard links, for each inode there is a link counter. The link counter is decremented each time a hard link is deleted and incremented when a hard link is created. Only when the link counter reaches zero are actual data blocks on disk released.
Hard links are subject to some restrictions. First, users are not allowed to create hard links to directories. Secondly, users are not allowed to create hard links that cross filesystems. This is because in the directory entry the inode number is used as a pointer to the inode, and inode numbers are only unique within one filesystem.
In many cases, access to a certain filesystem located on a device different from the device containing the root filesystem may be required. This separate location may be a different partition on the same hard drive, a CD in a CD-ROM drive or a network filesystem. In order to access the filesystem, the operating system needs to be told to take the filesystem and let it appear under a directory of the root filesystem. That directory serves as the so called mount point 95. This process is called mounting a filesystem. FIG. 2 shows the root filesystem 93 from FIG. 1 after a CD-ROM (separate filesystem 77) has been mounted to the mount point labeled /mnt/cdrom 95. As long as the CD-ROM 77 is mounted, the information stored on it will be available there and accessible through mnt/cdrom.
A “virtual filesystem”, also called VFS or Virtual Filesystem Switch is a software layer handling the generic tasks involved in the use of a filesystem. These generic tasks include error checking, caching VFS objects, doing the required locking and communicating with user space. Once the generic tasks are completed, the VFS passes the operation on to the specific layer, where it may be completed or again passed on to a next lower layer. FIG. 3 shows the relation between the generic VFS layer 71 and a specific filesystem layer 73.
To achieve this strict separation, the so called “common file model” was introduced. This object oriented model defines operations and objects which are common to all file systems. The following objects are defined in the common file model:    (1) The superblock object: this object holds all data relevant to a mounted filesystem 77. Disk-based filesystems usually store a copy of this object on disk.    (2) The inode object: this object stores information on a file located within a filesystem, such as size, permissions or access time. Most important, it holds the inode number which is used to uniquely identify a file within a filesystem. Disk-based filesystems usually store a copy of this object.    (3) The file object: this object is used to store information regarding an open file. It holds information on the file and the processes using it. This object exists only in memory.    (4) The dentry object: this object represents a link from a directory entry to an inode. This is necessary because one file can be represented by several hard links. The VFS 71 caches the dentry objects in order to speed up lookup operations.
Each of these objects has a set of operations associated with it which can be defined by a specific filesystem 73. These operations are implemented as a set of function pointers which are used by the VFS 71 to call the appropriate function of the specific filesystem 73. Thus, the separation of generic and specific layers shown in FIG. 3 is achieved where the VFS 71 is able to call the appropriate function of a specific filesystem 73 without having to know any specific detail on how the data is stored in that filesystem 73.
As can be seen from the above, filesystems are mainly implemented in kernel space where debugging is difficult and small bugs may crash a whole computer system. Also filesystems tend to consist of large amounts of code that, due to the nature of operating system kernels, is fairly complex. Therefore, changing and extending existing functionality is difficult.
Stackable or stacking filesystems provide a solution to some of these problems. Instead of modifying an existing filesystem, a new filesystem layer is stacked on top of the existing one. This new layer adds the required functionality in a modular way and therefore can be developed, tested and maintained separately. FIG. 4 shows the relationship between the VFS 71, a stackable filesystem 75 and the lower (specific) filesystem 73.
The VFS layer 71 that handles the generic actions involved in a filesystem 73 operation remains at the top of this hierarchy. When these actions have been completed, the subject function is called for the next lower level. The VFS 71 as a generic layer does not need to know what type of filesystem the lower layer is and will simply call the appropriate function via the VFS-object function pointers. This enables a new layer to be introduced, the new layer being the stackable filesystem layer 75. The stackable filesystem 75 is mounted on top of a mounted filesystem and accessed through the new mount point.
Further, the new stackable filesystem layer 75 has to behave differently from traditional filesystems. As lower layers 73 do not have any knowledge about their caller, the stackable layer 75 needs to present itself in two different manners: to the lower level as the VFS and to the higher as a traditional filesystem. When the VFS 71 calls a function of the stackable layer 75, generic tasks are first performed as done previously by the VFS 71 but for the stackable filesystem 75. Next the stackable file system layer 75 invokes zero or more functions (as object methods) on the layer 73 beneath it. When the call returns to the stackable filesystem layer 75, the result is handled accordingly and then returned to the VFS 71.
Thus, many applications and services employ the use of stacking filesystems because that allows the application or service to intercept filesystem operations and perform their own processing without any modification of the logic of the caller of the filesystem services. However, there are some particular challenges involved in implementing stacking filesystems for Linux systems. Linux kernels do not have a UNIX System V release 4 (SVR4)-style vnode interface. Instead Linux kernels have a different interface spread among several distinct object types and operation vectors. Examples include: inodes (struct inode and inode_operations), directory cache entries (dentries or dcache entries, struct dentry), open-file objects (struct file and file_operations); address space objects (address_space_operations) and super blocks (struct super_block and super_operations). Many operations map straightforwardly between the various Linux kernel operations and either SVR4-style vnode object methods (e.g., VOP_*( ) calls) or file system object methods (e.g., VFS_*( ) calls). The biggest difference that affects stacking filesystems is the way pathnames and lookups are handled.
To create a per-process private state, a stacking file system can establish a per-process root context for file operations. On an SVR4 system this can be done by issuing a hidden chroot to a special vnode in the stacking file system. This special vnode then serves as the starting point for pathnames beginning with “/”. VOP_LOOKUP( ) invoked on this vnode may return vnode objects from the original root file system, e.g. looking up “etc” returns the original root file system's vnode representing “/etc”. As lookups continue down the name tree and encounter another mounted file system instance from the stacking file system, the context can be found from the process's root directory vnode. The stacking file system then returns results consistent with its design.
A similar technique can be used with a Linux kernel but because of the difference in how pathnames and lookups are handled, various problems are encountered, particularly at the root of the file name space.
The Linux kernel requires file systems to use the directory name cache (dentry) structures instead of vnodes for many of its file operations. Most of the Linux kernel's namespace-related file operations operate on dentries. Many other kernel operations use dentries to hold references to files (most notably the current directory and process root directory). Each separately-mounted file system has a separate dentry tree. Within each mounted file system, these structures are linked in a hierarchical tree structure reflecting the namespace of the file system. The root of the tree is the root of the file system's name space. When a mount-point is encountered, the mounted-over dentry and the mounted-over file system structure (vfsmnt) are used as hash keys to find the new file system's root dentry and file system structure.
If the previous method from SVR4 is adapted to Linux, several of the namespace-modification operations fail in the root directory (e.g., link, unlink, rename). These operations require that the parent directory and object dentries be parent and child in the dentry cache hierarchy. If one uses the same method as above, the dentry for “/” will not be the parent of the dentry for “/etc”, because the “/etc” dentry is from the real file system while the dentry for “/” is from the stacking file system.
To address this failure, a Linux-specific method has been tried, but it too fails to work properly. This method shadows dentries for all files in the root file system. In this method, lookups of “etc” in “/” result in a dentry from the stacking file system. The stacking file system must simulate behavior of the real “/etc” object by passing on any requests of this virtual dentry to the real file system's dentry. However, certain requests cannot be passed on correctly, such as mount point crossings: the dentries provided by the stacking file system do not match the real ones, so the hash table lookups for dentries when crossing mount points does not work. The stacking file system thus must shadow all dentries in all mounted file systems. This causes a myriad of other problems, such as incorrect results from disk-space statistics queries (e.g. df -k) and improper behavior for socket files.