Conventional data-dependent deduplication may employ a window-based parser to identify boundary locations. Conventional window-based parsers are configured as fixed-size window sub-block parsers. Data in the fixed-size window is evaluated to determine whether it satisfies a constraint. When a constraint is satisfied, a boundary is identified in data being parsed. A small, fixed-size window may be efficient for identifying boundaries in certain types of data. For example, a small, fixed-size window can be efficient for processing data with very high entropy (e.g., random data). However, a small, fixed-size window may be inefficient for identifying boundaries in data with low entropy. Entropy is a measure of uncertainty associated with the randomness of data in an object to be data reduced. The entropy of data that is truly random is one. The entropy of a long string of duplicate characters is nearly zero. The entropy of most data falls between these two limiting examples.
The '810 patent (U.S. Pat. No. 5,990,810) describes one example of data-dependent deduplication that may employ a fixed-size window. Claim 1 of the '810 reads, in part:                organizing a block b of digital data . . .        by partitioning the block into subblocks at one or more positions k|k+1 in the block for which b[k−A+1 . . . k+B] satisfies a predetermined constraint,        where A and B are natural numbers.        
The notation b[k−A+1 . . . k+B] describes the “window” used by the parser. Data in the window is evaluated to determine whether a constraint is satisfied, which determines either A or B is zero. The '810 patent also describes a case where the constraint considers some of the data in a window b[k−A+1 . . . k+B] while ignoring some of the data in the window b[k−A+1 . . . k+B]. By way of illustration, a constraint that only pays attention to, for example, b[k−3] and b[k+2] while ignoring the other characters in the window b[k−A+1 . . . k+B] would fall under the classes of constraint corresponding to A>=4 and B>=2.
In some conventional systems, when the data in the window is high entropy data, then the parser will yield a geometric distribution of sub-block sizes. A truncated geometric distribution of sub-block sizes may be desirable for certain data sets and for certain processing. However, some data sets (e.g., those with low entropy) may not parse with a geometric distribution of sub-block sizes. In some examples, when the entropy is low, the parser may not meet its constraint in a small window. When the parser does not meet its constraint, the parser may only produce maximum length sub-blocks, which effectively degenerates the parser into a fixed length parser.
Smaller window sizes have been favored in some conventional systems. Smaller window sizes provide some advantages. For example, fast boundary check algorithms are relatively easier to generate for small window sizes. The algorithms are relatively easier to generate because they will consider relatively less data when placing a boundary than will be the case for a larger window. The algorithms are also relatively easier to generate because of the history involved in the rolling hash processing associated with evaluating a constraint. For example, a boundary checking algorithm may keep a history of the data currently seen in the window. Performance considerations may dictate that this history data be stored in a hardware register(s). The hardware registers may only be 32-bit or 64-bit, and thus smaller window sizes may be preferred.