The present invention relates generally to data deduplication and in particular to distinguishing deduplicatable parts of a dataset from non-deduplicatable parts of the dataset.
Data deduplication is a data compression technology that reduces bandwidth and storage space by eliminating duplicate copies of repeating data. In the deduplication process, data is analyzed, whereby unique chunks of data (i.e., “byte patterns”) are identified and stored. As the data is further analyzed, additional chunks of data are compared to previously identified and stored chunks of data. Whenever a match occurs between two chunks of data, the redundant chunk of data is replaced with a reference that points to the stored chunk of data. In other words, only one instance of the chunk of data is actually stored. Any subsequent instances of duplicate chunks of data are referenced back to the stored copy. Whereas the same byte pattern may occur thousands of times, data deduplication reduces the amount of data required to be transferred or stored.