The primary value of deduplication is obviously in reduction of required storage space, which translates into improved reliability and significant operational savings due to reduced power consumption and overall lower facility costs. However, for these savings to materialize, we need to consider total storage space necessary to keep not only backup data, but also all types of metadata (e.g. backup application metadata, intermediate level metadata like filesystems and proper back-end metadata used for location and deduplication). Adding metadata to the picture is not only necessary to estimate the savings from deduplication, but often changes the relative utility of various deduplication alternatives in a particular storage system.
Today, the standard technique for dividing backup data into chunks for deduplication is the content defined chunking (CDC) (NPL 16, 20). It produces variable-length chunks using Rabin's fingerprints (NPL 22) to select chunk boundaries in the input stream. CDC has many advantages. It does not maintain any state and it is effective in the presence of data insertions and deletions in the original stream, as unmodified cut points are recalled on the next chunker run, resulting in identification of unchanged chunks for deduplication. Moreover, to work well, CDC does not need to know backup stream boundaries and their sequencing. This fits nicely with the reality of commercial setups, where a standard backup application usually assumes “dumb” storage back-end, which makes communication of this information to the back-end impossible.