Data deduplication (sometimes referred to as data optimization) refers to detecting, uniquely identifying and eliminating redundant data in storage systems and thereby reducing the physical amount of bytes of data that need to be stored on disk or transmitted across a network, without compromising the fidelity or integrity of the original data. By reducing the resources needed to store and/or transmit data, data deduplication thus leads to savings in hardware and power costs (for storage) and data management costs (e.g., reducing backup costs). As the amount of digitally stored data grows, these cost savings become significant.
Data deduplication typically uses a combination of techniques for eliminating redundancy within and between persistently stored files. One such technique operates to identify identical regions of data in one or multiple files, and physically store only one unique region (chunk), while maintaining a reference to that chunk in association with the file for all the repeated occurrences of this data. Another technique is to mix data deduplication with compression, e.g., by storing compressed chunks.
There are many difficulties, tradeoffs and choices with data deduplication, including that in some environments, there is too much data to deduplicate in one single operation given available time and resources. A specific challenge exists with files which are partially deduplicated, such as a previously deduplicated file that is partially recalled to disk in response to new data or a modification of existing data. Deduplicating the partially deduplicated file in its entirety requires a significant amount of input/output resources for reading and chunking the entire file and then, writing the chunks into the chunk store as well as CPU resources for performing the chunking and compression.