In a large file system, it is common to find duplicate occurrences of individual blocks of data. Duplication of data blocks may occur when, for example, two or more files or other data containers share common data, or where a given set of data occurs at multiple places within a given file. Duplication of data blocks results in inefficient use of storage space by storing the identical data in a plurality of differing locations served by a storage system.
One technique that has been used to address this problem is to allow new data of a storage volume to share a data block with old data of the storage volume if the new data is identical to the old data. This technique may be performed on a storage volume by a process referred to as deduplication. In one example, a deduplication process may generate a hash, referred to as a fingerprint, of each data block stored on a storage volume. The fingerprints of the data blocks are stored in a data structure, such as a hash table or a flat file, which is referred to as a fingerprint database. When a new data block is to be stored, a fingerprint of the new data block is generated. The fingerprint is then compared with the fingerprint database containing the fingerprints of previously-stored blocks. If the new data block's fingerprint is identical to that of a previously-stored block, there is a high degree of probability that the new block is identical to the previously-stored block. In this case, a byte-by-byte comparison may be performed on the previously-stored data block and the new data block to determine if the data blocks are identical. If the data blocks are identical, the new block is replaced with a pointer to the previously-stored block, thereby reducing storage resource consumption.
However, existing data structures may not be optimal for storing the fingerprint database, particularly for storage systems in which the fingerprints have a high locality of reference, such as backup storage systems. A disadvantage of existing data structures is that the order of the fingerprints stored in the fingerprint database may not correspond to the order of the corresponding data blocks on the file system, e.g. two sequential fingerprints in the fingerprint database may not correspond to two sequential data blocks on the storage system. Thus, fingerprints can not effectively be pre-fetched and cached from the fingerprint database, but instead may be read individually. Since backup storage volumes may frequently be very similar, it may be desirable to pre-fetch and cache fingerprints of sequential data blocks during a deduplication operation of a backup storage volume.
Furthermore, storage systems may include increased data storage per disk head and increased computer processing unit cores and memory to service the disk heads. However, the amount of time required for the disk head to perform a random disk access has not improved. Thus, it may be desirable for storage systems to minimize random disk accesses, even at the expense of processing and/or memory resources. Therefore, there may be a need for a data structure for storing fingerprints which is capable of minimizing random disk accesses during a deduplication operation.