It is common to find duplicate blocks of data in a large file system of a storage system. Duplication of data blocks may occur when, for example, two or more files (or other data containers) of the file system share common data. Duplication of data blocks typically results in inefficient use of storage space consumed by the storage system because identical copies of the data are stored in multiple, different locations served by the storage system.
There are well known systems for de-duplicating duplicate data in such a file system. These system typically employ data deduplication operations which are performed on fixed size blocks, e.g., 4 kilobytes (KB) in size. When a new block is to be stored on the storage system, a hash value is typically utilized as an identifier or “fingerprint” of the 4 KB block, wherein the hash value may be computed on the block in accordance with a well-known mathematical function such as, e.g., a checksum function. The fingerprint may then be compared with a database containing fingerprints of previously stored blocks (i.e. a fingerprint database). Should the new block's fingerprint be identical to that of a previously stored block, there is a high degree of probability that the new block is an identical copy of the previously stored block. In such a case, the new block may be replaced with a pointer to the previously stored block, thereby reducing storage space consumption.
A noted disadvantage of these well-known de-duplication systems is that the fingerprint database may accumulate stale fingerprints. A stale fingerprint, as used herein, is a fingerprint that does not identify the current state of a corresponding block in the file system. Stale fingerprints may be generated due to deletion of files, truncation of files or as a result of certain file system operations including, e.g., hole punching. Hole punching is a technique utilized to reclaim storage space in response to data deletion in certain environments, e.g., in an environment wherein a data container having a first data layout format is overlaid onto a storage space having a second data layout format. As will be appreciated by one skilled in the art, an operation that deletes a block from the file system, but does not write or overwrite the block, may result in a stale fingerprint. As the fingerprint database is typically stored in memory or secondary storage of the storage system, storage of stale fingerprints may cause consumption of additional storage system resources (such as memory and/or storage space). Further, as the size of the fingerprint database increases, the time required to perform certain operations, such as search operations during de-duplication, increases, thereby reducing storage system efficiency.
One technique to eliminate stale fingerprints is to log the blocks which have been deleted in a stale fingerprint data structure and then utilize a known data structure, such as a binary search tree (BST), to identify the most recent fingerprints associated with each deleted block. A noted disadvantage of such a technique is that the BST approach operates with a complexity of O(n2). As will be appreciated by one skilled in the art, this technique quickly becomes cumbersome as the number of deleted blocks increases in a large file system.