In a large file system, it is common to find duplicate occurrences of individual blocks of data. Duplication of data blocks may occur when, for example, two or more files or other data containers share common data or where a given set of data occurs at multiple places within a given file. Duplication of data blocks results in inefficient use of storage space by storing the identical data in a plurality of different locations served by a storage system.
A technique, commonly referred to as “deduplication,” that has been used to address this problem involves detecting duplicate data blocks by computing a hash value (fingerprint) of each new data block that is stored on disk, and then comparing the new fingerprint to fingerprints of previously stored blocks. When the fingerprint is identical to that of a previously stored block, the deduplication process determines that there is a high degree of probability that the new block is identical to the previously stored block. The deduplication process then compares the contents of the data blocks with identical fingerprints to verify that they are, in fact, identical. In such a case, the block pointer to the recently stored duplicate data block is replaced with a pointer to the previously stored data block and the duplicate data block is deallocated, thereby reducing storage resource consumption.
Deduplication processes assume that all data blocks have a similar probability of being shared. However, this assumption does not hold true in certain applications. For example, this assumption does not often hold true in virtualization environments, where a single physical storage server is partitioned into multiple virtual machines. Typically, when a user creates an instance of a virtual machine, the user is given the option to specify the size of a virtual disk that is associated with the virtual machine. Upon creation, the virtual disk image file is initialized with all zeros. When the host system includes a deduplication process, such as the technique described above, the zero-filled blocks of the virtual disk image file may be “fingerprinted” and identified as duplicate blocks. The duplicate blocks are then deallocated and replaced with a block pointer to a single instance of the block on disk. As a result, the virtual disk image file consumes less space on the host disk.
However, there are disadvantages associated with a single instance of a block on disk being shared by a number of deallocated blocks. One disadvantage is that “hot spots” may occur on the host disk as a result of the file system frequently accessing the single instance of the data. This may occur with high frequency due to the fact that the majority of the free space on the virtual disk references the single zero-filled block. To reduce hot spots, some deduplication processes include a provision for predefining a maximum number of shared block references (e.g., 255). When such a provision is implemented, the first 255 duplicate blocks reference a first instance the shared block, the second 255 duplicate blocks reference a second instance, and so on.
Another disadvantage of deduplication is disk fragmentation. Disk fragmentation may occur as a consequence of the duplicate blocks being first allocated and then later deallocated by the deduplication process. Moreover, the redundant allocation and deallocation of duplicate blocks further results in unnecessary processing time and bookkeeping overhead.