Deduplication is a data compression technique that can be used to reduce storage capacity required to store data and reduce the bandwidth required to transfer data. In the event a cluster (the smallest logical amount of disk space that can be allocated by a file system) of data appears more than once, all instances but one can be replaced by pointers to a single instance of the cluster. Each replacement, for example, of a 4 KB cluster can be replaced with a 64-bit pointer resulting in a 512-fold decrease in data size per replacement.
Data is commonly stored on hard disks in clusters, the size of which is determined by the hypervisor or other operating system controlling the disk. For newer hard disks, 4096 bytes (i.e., 4 KB) is a standard cluster size. Deduplication can be used to effectively increase the amount of data that can be stored by a hard disk. Virtual-machine disk images are also cluster based. Deduplication can be used to reduce the size of a virtual-machine image and thus the storage capacity and bandwidth required respectively to store and transfer the virtual-machine image.
Cluster-by-cluster comparisons for all clusters of a disk image can be resource intensive. Comparisons are performed by loading clusters into memory. Comparisons of clusters in memory can be performed relatively fast, but the loading of clusters from disk is time consuming.
To reduce the number of disk swaps required, the clusters can be hashed and the resulting hashes compared. For example, a 256-bit (32-byte) hash can be generated for each 32-kbit (4 KB) blocks, providing a 128-fold data reduction in the amount of data that must be held in memory per cluster to effect comparisons. However, it still may not be feasible to hold in memory hashes for all the clusters at once. Time consuming disk accesses may still be required. What is needed is an approach to deduplication that further reduces the number of disk accesses required.