Data deduplication is often used to identify and eliminate duplicate copies of repeating data. As a result, data deduplication is used to improve storage utilization, to reduce the amount of data transferred over a network connection, etc. For example, a file system may periodically generate hashes of new files and determine if any matches exist for the new file hashes. When a new file hash matches a hash for an older file, the new file data is removed and replaced with a pointer to the older file.
A host computer attached to an external storage array may also utilize deduplication to reduce read requests transmitted to the storage array. For example, the host may generate and maintain a manifest of hashes for each block of data the storage array stores in one or more virtual machine disks for a virtual machine running on the host. If the host detects a read request from the virtual machine, the host retrieves the corresponding hashes, for the blocks being requested, from the manifest. The retrieved hashes are compared to hashes mapped to data within the host's cache. If the retrieved hashes matches hashes mapped to cached data, the host returns the cached data to the virtual machine in response to the read request (rather than reading the data from the storage array).
The generation and maintenance of such a manifest, however, places a large demand on the host's time and processing resources and requires a large amount of data transfer from the storage array to the host. For example, the host computer reads each virtual disk stored on the storage array to generate hashes for the manifests. Hashes need to be regenerated as the corresponding data changes over time. The generating and regenerating of hashes lead to downtime during which the corresponding virtual disk(s) are otherwise inaccessible. The manifest also occupies storage space that may be used for other purposes.