In a data storage system it is desirable to use storage space as efficiently as possible, to avoid wasting storage space. One type of system in which this concern can be particularly important is a storage server, such as a file server. File servers and other types of storage servers often are used to maintain extremely large quantities of data. In such systems, efficiency of storage space utilization is critical.
Data containers (e.g., files) maintained by a file system generally are made up of individual blocks of data. A common block size is four kilobytes. In a large file system, it is common to find duplicate occurrences of individual blocks of data. Duplication of data blocks may occur when, for example, two or more files have some data in common or where a given set of data occurs at multiple places within a given file. Duplication of data blocks results in inefficient use of storage space.
A de-duplication process eliminates redundant data within a file system. A de-duplication process can occur in-line and offline. When a de-duplication process occurs while data is being written to a file system, the process can be referred to as ‘in-line de-duplication.’ When a de-duplication process occurs after data is written to a storage device (e.g., disk), the process can be referred to as ‘offline de-duplication.’ A de-duplication process can further be described, for example, to include two operations, such as a ‘de-duplication operation’ (identify and eliminating duplicate data blocks) and a ‘verify operation’ (identify and removing stale entries from a fingerprints datastore). The de-duplication process keeps a fingerprint value for every block within a file system in a fingerprints datastore (FPDS). This fingerprints datastore is used to find redundant blocks of data within the file system during a de-duplication operation. For example, typically, the fingerprint datastore is sorted on the basis of fingerprints to efficiently find potential duplicates. However, maintaining one entry for each block in a file system increases the size of the fingerprints datastore drastically. An increased fingerprints datastore size consumes more time during a de-duplication operation and verify operation.
De-duplication involves the fingerprints datastore having some fingerprint entries that are stale. A stale fingerprint entry is an entry that has a fingerprint that corresponds to a data block that has been deleted (freed) or overwritten, for example, during a de-duplication operation. The stale entries do not contribute to any space savings and add significant overhead in subsequent operations on the fingerprints datastore. These stale entries can be removed, for example, using a verify operation. Current implementations of a verify operation include two stages. In stage one, the fingerprints datastore is first sorted in order by <file identifier, block offset in a file, time stamp> to check whether a fingerprint entry is stale or not for each entry. The fingerprints datastore is then overwritten with only the stale-free entries. In stage two, the output from stage one is sorted back to its original order (e.g., fingerprint, inode, file block number (fbn)). Several problems with this conventional approach include sorting the fingerprints datastore twice with each verify operation and the second sort is unnecessary to remove the stale entries. Moreover, the conventional approach overwrites the entire FPDS with stale-free entries, even if the number of stale entries is a small percentage of the FPDS. In addition, a verify operation is typically a blocking operation, and thus, if a verify operation is executing on the FPDS, then no other de-duplication (sharing) operation can execute because de-duplication operations and verify operations should work from a consistent copy of the FPDS.
De-duplication includes logging fingerprints of any new data block that is written or updated in the file system into a changelog file. The changelog file is merged with fingerprints datastore to find duplicate blocks and to eliminate the duplicate data blocks. During this process, the fingerprints datastore is overwritten with the merged data with every de-duplication operation. Overwriting the entire fingerprints datastore with every de-duplication operation, however, can involve a large amount of write cost.
In addition, current de-duplication operations use logical information to identify blocks in a volume and their associated fingerprints. De-duplication maintains a fingerprint entry in the fingerprints datastore for each <inode, fbn>. That means, if a block is shared ‘n’ times, the fingerprints datastore will have ‘n’ entries for a single fingerprint value. In cases, however, where there is a significant amount of logical data, a fingerprints datastore cannot scale proportionately.