Data deduplication (sometimes referred to as data optimization) refers to detecting, uniquely identifying and eliminating redundant data in storage systems and thereby reducing the physical amount of bytes of data that need to be stored on disk or transmitted across a network, without compromising the fidelity or integrity of the original data. By reducing the resources needed to store and/or transmit data, data deduplication thus leads to savings in hardware and power costs (for storage) and data management costs (e.g., reducing backup costs). As the amount of digitally stored data grows, these cost savings become significant.
Data deduplication typically uses a combination of techniques for eliminating redundancy within and between persistently stored files. One such technique operates to identify identical regions of data in one or multiple files, and physically store only one unique region (chunk), while maintaining a reference to that chunk in association with the file for all the repeated occurrences of this data. Another technique is to mix data deduplication with compression, e.g., by storing compressed chunks.
There are many difficulties, tradeoffs and choices with data deduplication, including that in some environments, there is too much data to deduplicate all of it in one single operation given available time and resources, whereby consideration has to be given to which data to deduplicate, and how to stage progressive deduplication over time. Moreover, not all data that can be deduplicated yields equal savings (benefits) from deduplication, and there is thus the potential for doing a lot of work for little value. Other aspects of data deduplication, including file selection, data security concerns, different types of chunking, different types of compression and so forth also need to be dealt with in order to accomplish data deduplication in a way that provides desirable results.