The amount of data being stored and transmitted in modern data processing networks is growing rapidly as Web 2.0 technologies and content-rich media proliferate. Increasing employee mobility and rising capabilities of end user systems (e.g. laptops, smartphones) also increase the demand for content storage and transmission, as do disaster recovery and enterprise globalization technologies, which frequently involve distribution of multiple copies of data over large geographical areas. At the same time, the cost and operational expense of maintaining network links and large pools of storage devices remains high.
A number of technologies have emerged to address the explosive demand for network bandwidth and storage capacity, including data reduction techniques such as caching, compression and de-duplication. Data de-duplication is of particular interest and involves dictionary-based reduction of extremely large volumes of data (e.g., terabytes or more) into smaller quantities of stored or transmitted data.
FIG. 1 illustrates a prior-art de-duplication engine 100 that produces a de-duplicated output data volume, Y, in response to an input data volume, X. Following the conventional approach, breakpoints are identified within the input data volume based on the data content itself, thereby dividing the input data volume into multiple content-defined segments. A hash index is computed for each segment and compared with the contents of a hash table. If a matching hash index is found within the table, a dictionary segment pointed to by the matching hash table entry is retrieved and compared byte for byte with the input data segment. If the dictionary segment and input data segment match, then a token associated with the dictionary segment is inserted into the output data volume in place of the input data segment, thus reducing the output volume relative to the input volume (if the segments do not match or no matching hash index is found, the input data segment may be added to the dictionary and the corresponding hash index added to the hash table to effect a dictionary update). A converse operation is performed at the transmission destination (or upon retrieval from mass storage media), indexing the dictionary using the token (a matching dictionary is maintained at the destination) to restore the original data segment within a recovered data volume.
One substantial drawback of the foregoing de-duplication scheme is the intensive computation required to identify the breakpoints and hash index. In a typical implementation, a “fingerprint” is computed for each byte of the input data volume—a calculation that generally involves a polynomial division over a range of data extending from the byte of interest—to determine whether the subject byte constitutes a breakpoint (e.g., fingerprint meets some predetermined criteria, such as ‘0’s in some number of bit positions). The hash index computation is similarly carried out for each byte of the input data volume and may similarly involve a compute-intensive calculation. The computing demand is particularly onerous in de-duplication systems that employ “strong” or near-perfect hashing functions in an effort to avoid hash collisions (e.g., SHA-1, MD5 or the like). In general, the breakpoint identification and hash index computation are so demanding as to render the de-duplication operation impractical for high-bandwidth streaming data, thus requiring the data de-duplication operation to be executed offline for many important classes of applications.
The conventional approach is further plagued by dictionary “misses” that result from minor data modifications. Changing even a single byte within a segment will generally yield an entirely different hash index, particularly in applications that employ strong or near-perfect hashing, and thus produce a miss within the hash table (or worse, a hit within the hash table followed by a miss in the bytewise compare). Even more problematic is a modification within the region that produced a breakpoint in the original input data volume as the resulting breakpoint loss will cause a dictionary miss for both of the segments previously delineated by the breakpoint (i.e., one segment ended by the breakpoint and another segment begun).