A file server is a computer that provides file service relating to the organization of information on writeable persistent storage devices, such as memories, tapes or disks. The file server or filer may be embodied as a storage system including a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g., the disks. Each “on-disk” file may be implemented as set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored.
A filer may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on a server, e.g., the filer. In this model, the client may comprise an application executing on a computer that “connects” to the filer over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. Each client may request the services of the file system on the filer by issuing file system protocol messages (in the form of packets) to the filer over the network.
A common type of file system is a “write in-place” file system, wherein the locations of the data structures, such as inodes and data blocks, on disk are typically fixed. An inode is a data structure used to store information, such as meta-data, about a file, whereas the data blocks are structures used to store the actual data for the file. The information contained in an inode may include, e.g., ownership of the file, access permission for the file, size of the file, file type and references to locations on disk of the data blocks for the file. The references to the locations of the file data are provided by pointers, which may further reference indirect blocks that, in turn, reference the data blocks, depending upon the quantity of data in the file. Changes to the inodes and data blocks are made “in-place” in accordance with the write in-place file system. If an update to a file extends the quantity of data for the file, an additional data block is allocated and the appropriate inode is updated to reference that data block.
Another type of file system is a write-anywhere file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into memory and “dirtied” with new data, the data block is stored (written) to a new location on disk to thereby optimize write performance. A write-anywhere file system may initially assume an optimal layout such that the data is substantially contiguously arranged on disks. The optimal disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks.
Both the write-anywhere file system and the write in-place file system may be implemented on a file server configured to generate a persistent image of its active file system at a particular point in time for, e.g., storage on disk. The disk storage may be implemented as one or more storage “volumes” that comprise a cluster of physical storage devices (disks) defining an overall logical arrangement of disk space. Each volume is generally associated with its own file system. The persistent image of the active file system is useful in that it may be used in many applications, including asynchronous mirroring or other automated file system replication facilities.
Assume a client stores files organized as its home directory on a file server configured to generate a persistent image of its active file system. Some time after storing those files, the server generates the persistent image of its active file system. Assume the client thereafter performs a backup operation to (again) store all of the files in its home directory on the server. Since the persistent image was generated before the files were overwritten, the file system's notion of the files contains old contents. However, the old contents are in the persistent image; overwriting the old data blocks with new data results in allocating new blocks and writing the new data to the new blocks.
Accordingly, the majority of data in the home directory is identical to that stored during the previous backup operation and the effective behavior is that the “old” contents of each file are replaced with “new” contents that happen to be the same as the previous contents. Unfortunately, if those previous file contents already reside in the persistent disk image, the file system may be “unaware” of the fact that the new contents are identical and therefore writes the new data to new locations on the disk. This results in two copies of the same data for the same file and, hence, duplication and inefficient disk storage of the file data.
One solution to this problem is to disable generation of persistent images of an active file system on the server volume. However, this may be undesirable because those features, benefits and applications that rely on such persistent images will no longer be available to the client. Another solution is to check the contents of the file in all persistent images at the time the file is overwritten. This may not always be possible because at write-time, the client may send less than one block of data per write operation (i.e., a partial write operation). Also, there could be multiple write operations before an entire block is full. These actions can result in numerous unnecessary and inefficient block comparison operations. The present invention is directed to solving the inefficiencies associated with file servers configured to generate persistent images of their active file systems.