Data deduplication (sometimes referred to as data optimization) is a recent trend in storage systems and generally refers to reducing the physical amount of bytes of data that need to be stored on disk or transmitted across a network, without compromising the fidelity or integrity of the original data, i.e., the reduction in bytes is lossless and the original data can be completely recovered. By reducing the resources needed to store and/or transmit data, data deduplication thus leads to savings in hardware costs (for storage and network transmission) and data-managements costs (e.g., backup). As the amount of digitally stored data grows, these cost savings become significant.
Data deduplication typically uses a combination of techniques for eliminating redundancy within and between persistently stored files. One technique operates to identify identical regions of data in one or multiple files, and physically store only one unique region (chunk), while maintaining a pointer to that chunk in association with the file. Another technique is to mix data deduplication with compression, e.g., by storing compressed chunks for each unique chunk.
In order to identify the chunks, the server that stores the chunks maintains a hash index service for the hashes of the chunks in the system. The hash uniquely identifies the chunk and serves as the key of a key, value pair. The value corresponds to the location of the chunk in a chunk store.
Because contemporary deduplication systems may need to scale to tens of terabytes to petabytes of data volume, the chunk hash index is too large to fit into a primary storage device (i.e., RAM). Thus, a secondary storage device needs to be used, such as hard disk drive or solid state drive. Index operations are thus throughput-limited by the relatively slow I/O operations executed on the secondary storage device. What is needed is a way to reduce the I/O access times as much as possible given limited primary storage resources.