Data deduplication (sometimes referred to as data optimization) refers to reducing the physical amount of bytes of data that need to be stored on disk or transmitted across a network, without compromising the fidelity or integrity of the original data, i.e., the reduction in bytes is lossless and the original data can be completely recovered. By reducing the resources needed to store and/or transmit data, data deduplication thus leads to savings in hardware costs (for storage and network transmission) and data-managements costs (e.g., backup). As the amount of digitally stored data grows, these cost savings become significant.
Data deduplication typically uses a combination of techniques for eliminating redundancy within and between persistently stored files. One technique operates to identify identical regions of data in one or multiple files, and physically storing only one unique region (chunk), while maintaining a pointer to that chunk in association with the file. Another technique is to mix data deduplication with compression, e.g., by storing compressed chunks.
In order to identify the chunks, the server that stores the chunks maintains a hash index service for the hashes of the chunks in the system. The hash does not have locality, i.e., chunk hashes for chunks in the same file are unrelated, and any edits to a given chunk's content create a very different (unrelated) hash value. Thus traditional database technology, such as B-tree indexing, leads to poor performance in index serving. Maintaining the entire index in memory provides good performance, but consumes too many resources. The server memory resource is needed by other server applications (e.g., in primary data deduplication scenarios), and for caching.
Prior backup-oriented data deduplication optimization has relied upon a look-ahead cache to reduce the amount of resources used in accessing the index on the server. However, data deduplication is no longer limited to data backup scenarios, and is moving towards being used as a primary data storage cluster accessed like any other storage device. The use of a look-ahead cache alone to reduce the resource usage is not an adequate solution.