A file server is a computer that provides file service relating to the organization of information on writeable persistent storage devices, such as memories, tapes or disks. The file server or filer may be embodied as a storage system including a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g., the disks. Each “on-disk” file may be implemented as set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
A filer may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, e.g., the filer. In this model, the client may comprise an application executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. Each client may request the services of the file system on the filer by issuing file system protocol messages (in the form of packets) to the filer over the network.
A common type of file system is a “write in-place” file system, wherein the locations of the data structures, such as inodes and data blocks, on disk are typically fixed. An inode is a data structure used to store information, such as meta-data, about a file, whereas the data blocks are structures used to store the actual data for the file. The information contained in an inode may include, e.g., ownership of the file, access permission for the file, size of the file, file type and references to locations on disk of the data blocks for the file. The references to the locations of the file data are provided by pointers, which may further reference indirect blocks that, in turn, reference the data blocks, depending upon the quantity of data in the file. Changes to the inodes and data blocks are made “in-place” in accordance with the write in-place file system. If an update to a file extends the quantity of data for the file, an additional data block is allocated and the appropriate inode is updated to reference that data block.
Another type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks.
Both the write-anywhere file system and the write in-place file system may be implemented on a file server configured to generate a persistent image of its active file system at a particular point in time for, e.g., storage on disk. The disk storage may be implemented as one or more storage “volumes” that comprise a cluster of physical storage devices (disks) defining an overall logical arrangement of disk space. Each volume is generally associated with its own file system. The persistent image of the active file system is useful in that it may be used in many applications, including asynchronous mirroring or other automated file system replication facilities.
Assume a client stores files organized as its home directory on a file server configured to generate a persistent image of its active file system. Some time after storing those files, the server generates the persistent image of its active file system. Assume the client thereafter performs a backup operation to (again) store all of the files in its home directory on the server. Since the persistent image was generated before the files were overwritten, the file system's notion of the files contains old contents. However, the old contents are in the persistent image; overwriting the old data blocks with new data results in allocating new blocks and writing the new data to the new blocks.
Accordingly, the majority of data in the home directory is identical to that stored during the previous backup operation and the effective behavior is that the “old” contents of each file are replaced with “new” contents that happen to be the same as the previous contents. Unfortunately, if those previous file contents already reside in the persistent disk image, the file system may be “unaware” of the fact that the new contents are identical and therefore writes the new data to new locations on the disk. This results in two copies of the same data for the same file and, hence, duplication and inefficient disk storage of the file data.
One solution to this problem is to disable generation of persistent images of an active file system on the server volume. However, this may be undesirable because those features, benefits and applications that rely on such persistent images will no longer be available to the client. Another solution is to check the contents of the file in all persistent images at the time the file is overwritten, i.e., modified. This may not always be possible because at write-time, the client may send less than one block of data per write operation (i.e., a partial write operation). Also, there could be multiple write operations before an entire block is full. These actions can result in numerous unnecessary and inefficient block comparison operations. Furthermore, the data for a file may be written to a temporary file which is subsequently renamed to the original file name. This is a common procedure for many application programs. In this case, the server cannot check for identical data until the temporary file is renamed to the original file name.
Another solution is to reduce the number duplicated blocks by having the inodes in both the active file system and the persistent disk image point to the same block. This can be accomplished if, for example, the contents of the disk block have not been modified since the time the persistent disk image (i.e. snapshot) was created. An example of a process to reduce such duplicated blocks is described in U.S. patent application Ser. No. 10/104,694, entitled File Folding Technique, by A. Kahn et al. In order to reduce duplicate data blocks, a file system requires some way to identify files that are likely candidates for such a file folding process. The candidate files could be “manually” provided to the file system, i.e., user initiated, but this would require an impractically high level of user interaction to maintain the file system. The present invention is directed to reducing inefficiencies associated with file servers when identifying such candidate files.
The present invention is further directed to keeping track of files that have been modified on a file server to thereby increase the efficiency of the server.
Moreover, the present invention is directed to keeping track of files that have been modified on a file server to identify potential candidates for other file system processes, such as the above-referenced file folding process.