Information stored on computer systems often contains substantial redundancies. Storage, communication, and comparison of information can be made more efficient if the information can be segmented intelligently. For example, if a segment has been previously stored or transmitted, then a subsequent request to store or transmit the segment can be replaced with the storage or transmission of an indicator identifying the previously stored or transmitted segment. The indicator can then be used to reconstruct the original segment by referring to the previously stored or transmitted version.
Selecting the boundary of the segment intelligently improves efficiency. For example, if a sequence of bytes appears identically in a number of different locations in the data set or stream (e.g., ‘XYZABCDERNNABCDE’ contains two occurrences of ‘ABCDE’) and that sequence of bytes (‘ABCDE’) is defined to be one of the segments, then the system could avoid storing the second occurrence of the segment and instead store a reference to the first copy. Note if a segment boundary is defined differently each time, then the system may or may not be able to recognize the identical run of bytes in two different segments—for example, a segment of ‘ABCDEX’ and a segment of ‘XABCDE’ may or may not be recognized in the system as having the same sequence of information of ‘ABCDE.’ If a segment boundary divides ‘XXXABCDEXXXX’ into ‘XXXABC’ and ‘DEXXXX’ then ‘ABCDE’ would not be found as a previously stored sequence. It is important to partition the segments so that the runs of identical bytes are grouped together in the same segment in order to achieve a better storage or transmission efficiency.
In some cases, blocks have been partitioned by setting a series of boundaries that are located within areas of the block that are determined to be similar, or identical, to each other. Similarity, or identity, of these areas within the block can be determined by comparing the areas within the block and seeing if they satisfy a predetermined criteria—for example, if the hash of data values between positions k−A+1 and k+B in block b, where A and B are natural numbers, satisfies a predetermined constraint (e.g., the bottom 12 bits of the hash are all 0's). A boundary is then set within the area of the block that is determined to be similar, or identical—for example, somewhere between k−A+1 and k+B within the block.
However, typically other data values surrounding the similar, if not identical, areas of data are also often similar or identical. Therefore, placing the boundary in the middle of this area of data breaks the data inefficiently because the similar or identical data values in the block are not part of the same segment.
Also, it is important to be able to establish minimum and maximum limits for segments. Simply locating anchors and setting boundaries can produce segments that are not limited to lengths within minimum and maximum limits for segments and thus requiring a separate evaluation and decision processes in order to satisfy the minimum and maximum segment length constraints.
It would be beneficial to increase the amount of similar or identical data value areas in segments to help with the efficiency of storage, communication, or comparison. It would also be beneficial to not have separate evaluation and decision processes in order to satisfy minimum and maximum segment length constraints.