Data de-duplication involves finding similarities (e.g., copies) in data files and compressing or otherwise reducing the utilized memory based on those identified similarities. For example, instead of storing duplicate data at various locations in memory, a single copy of that data may be stored and the duplicated instances may be replaced with a reference (e.g., a pointer) to the single copy. It should be appreciated that the reference often requires significantly less memory to store than the duplicate copies of the data.
The algorithms associated with identifying the duplicates themselves tend to be the more computationally intensive aspects of data de-duplication and are often executed on general-purpose processors (e.g., the main processors) of computing devices. As such, the speed and quality of those algorithms is paramount. Some algorithms for identifying data duplication or, more specifically, for identifying natural boundaries of data “chunks” include a rolling hash or fingerprint in which hashes of a set of contiguous bytes of fixed size (e.g., within the hash window) are computed on a rolling basis. A simple algorithm may involve computing the hash of the contiguous bytes at every location, whereas more sophisticated algorithms such as Rabin-Karp fingerprinting attempt to reduce computational overhead by incrementally adding the contribution of the next byte in the data buffer and subtracting the contribution of the byte that is no longer within the hash window (i.e., the “oldest” byte).