In a data storage system it is desirable to use storage space as efficiently as possible, to avoid wasting storage space. One type of system in which this concern can be particularly important is a storage server, such as a file server. File servers and other types of storage servers often are used to maintain extremely large quantities of data. In such systems, efficiency of storage space utilization is critical.
Files maintained by a file server generally are made up of individual blocks of data. A common block size is four kilobytes. In a large file system, it is common to find duplicate occurrences of individual blocks of data. Duplication of data blocks may occur when, for example, two or more files have some data in common or where a given set of data occurs at multiple places within a given file has. Duplication of data blocks results in inefficient use of storage space.
A technique which has been used to address this problem in the prior art is referred to as “file folding”. The basic principle of file folding is to allow new data of a file in the active file system to share a disk block with the old data of the file in a persistent image if the new data are identical to the old data. By using file folding, ideally only one occurrence of each unique data block will exist in a file system. This technique has been implemented in file servers, known as Filers, made by Network Appliance, Inc., of Sunnyvale, Calif. Specifically, Network Appliance Filers are capable of acquiring a Snapshot™ of a specified set of data. A “Snapshot” is a persistent, read-only image of the storage system, and more particularly, of the active file system, at a particular instant in time. If a block within a file that has been “Snapshotted” is modified after the Snapshot, rather than creating another complete (modified) copy of the file in the active file system, the Filer only creates the modified block for that file in the active file system; for each unmodified block, the Filer simply gives the file a pointer to the corresponding block in the Snapshot. In this way, the unmodified blocks in the Snapshot become shared between the Snapshot and the active file system. This technique is described in greater detail in U.S. Patent Application Publication no. 2003/0182317, entitled, “File Folding Technique,” filed on Mar. 22, 2002 by A. Kahn et al., and assigned to the assignee of the present application.
File folding does help to more efficiently use storage space. However, it is desirable to reduce data duplication in an active file system without having to rely upon a persistent point-in-time image (e.g., a Snapshot). It is also desirable to reduce data duplication regardless of the location of the data in the file system.
Another prior art approach to avoiding duplication of data in a storage system involves computing a hash value for every file that is stored. For example, in one known prior art system, which does not use a traditional (hierarchical) file system approach, a storage server is used to store data on behalf of an application server or other client. When the application server wants the storage server to store a particular file, the application server computes a hash value for the file and sends the storage server a write request containing the file and the hash value.
The storage server uses hash values of files to help reduce data duplication. More specifically, the storage server maintains a database containing a mapping of all of the stored files to their respective hash values. When the storage server receives a write request with a hash value, it searches for a match of that hash value in its database. If no match is found, the storage server concludes that it does not have a copy of that file already stored, in which case the storage server requests the file from the application server. If a match of the hash value is found, however, the storage server concludes that it already has a copy of that file stored and, therefore, does not have to request the file from the application server.
This method of using hash values employs a proprietary set of protocols and semantics, which are very different from those used in a traditional (hierarchical) file system. Further, the need to compute a hash value for every read or write and for every data block adversely affects performance, particularly during reads. In addition, every time a file is modified, the file has to be stored as a new file with a new hash value associated with it. Moreover, this approach involves complicated cleanup issues with regard to determining when particular blocks can be freed.