Data deduplication (sometimes referred to as data optimization) refers to eliminating redundant data in storage systems and thereby reducing the physical amount of bytes of data that need to be stored on disk or transmitted across a network, without compromising the fidelity or integrity of the original data. By reducing the resources needed to store and/or transmit data, data deduplication thus leads to savings in hardware costs (for storage) and data-managements costs (e.g., backup). As the amount of digitally stored data grows, these cost savings become significant.
Data deduplication typically uses a combination of techniques for eliminating redundancy within and between persistently stored files. One technique operates to identify identical regions of data in one or multiple files, and physically store only one unique region (chunk), while maintaining a reference to that chunk in association with the file. Another technique is to mix data deduplication with compression, e.g., by storing compressed chunks.
The data of deduplicated files are thus stored in chunks or compressed chunks in a chunk store, where the files themselves are left as “stubs” comprising references to the chunks. When a user or an application needs to access a deduplicated file, a deduplication engine brings the data back into memory (referred to as rehydration) or to disk (referred to as recall). When a user or an application modifies that data, parts of the old optimized data may be needed to ensure data consistency and integrity.
The process of rehydration or recall introduces latency in data access because of the need to (possibly) decompress chunks, because of file fragmentation that is introduced by chunking, and because of the chunk store's location/implementation. Full file recall introduces high latency and relatively considerable I/O overload. When the file is large, the latency and resource consumption problems worsen.
Further, when a full large file has been recalled, the deduplication engine may need to again deduplicate the file. This requires a lot of resources and affects overall data deduplication throughput, which is also a challenge considering the large amount of data a typical deduplication system needs to manage.