Information stored on computer systems often contains substantial redundancies. Storage, communication, and comparison of information can be made more efficient if the information can be segmented intelligently. For example, if a segment has been previously stored or transmitted, then a subsequent request to store or transmit the segment can be replaced with the storage or transmission of an indicator identifying the previously stored or transmitted segment. The indicator can then be used to reconstruct the original segment by referring to the previously stored or transmitted version.
Selecting the boundary of the segment intelligently improves efficiency. For example, if a sequence of bytes appears identically in a number of different locations in the data set or stream (e.g., ‘XYZABCDERNNABCDE’ contains two occurrences of ‘ABCDE’) and that sequence of bytes (‘ABCDE’) is defined to be one of the segments, then the system could avoid storing the second occurrence of the segment and instead store a reference to the first copy. Note if a segment boundary is defined differently each time, then the system may or may not be able to recognize the identical run of bytes in two different segments—for example, a segment of ‘ABCDEX’ and a segment of ‘XABCDE’ may or may not be recognized in the system as having the same sequence of information of ‘ABCDE.’ If a segment boundary divides ‘XXXABCDEXXXX’ into ‘XXXABC’ and ‘DEXXXX’ then ‘ABCDE’ would not be found as a previously stored sequence. It is important to partition the segments so that the runs of identical bytes are grouped together in the same segment in order to achieve a better storage or transmission efficiency.
Also, it is important to be able to establish minimum and maximum limits for segments. Simply locating anchors and setting boundaries can produce segments that are not limited to lengths within minimum and maximum limits for segments and thus require a separate evaluation and decision processes in order to satisfy the minimum and maximum segment length constraints.
It would be beneficial to select boundaries that are likely to increase the amount of similar or identical data value areas in segments to help with the efficiency of storage, communication, or comparison. It would also be beneficial to not have separate evaluation and decision processes in order to satisfy minimum and maximum segment length constraints.