As the amount of information to be stored and transmitted by computer systems or other electronic devices has dramatically increased, techniques have been developed to allow for more efficient data storage and processing. In some cases, chunking algorithms have been used to achieve improved efficiency and speed. Chunking algorithms partition one or more data objects into non-overlapping chunks. By dividing one or more data objects into chunks, a system is able to identify chunks that are shared by more than one data object or occur multiple times in the same data object, such that these shared chunks are stored just once to avoid or reduce the likelihood of storing duplicate data.
One type of chunking algorithm is a landmark chunking algorithm, which performs partitioning of one or more data objects by first locating landmarks present in the one or more data objects. The landmarks are short predefined patterns of data whose locations are used in determining chunk boundaries. By convention, each landmark is considered to occur at a single position, often the position immediately following that landmark's pattern.
The landmark chunking algorithm then determines chunk boundaries from the landmark locations. The simplest landmark chunking algorithm places a chunk boundary at each landmark. More complicated landmark chunking algorithms take into account the distance between landmarks in order to, for example, avoid too small or too large chunks. Note that for such algorithms, not all landmarks will be designated chunk boundaries and not all chunk boundaries are located at landmark positions. In one example, a landmark may be considered to be located at any position in a data stream immediately following a new line character (the pattern). Landmark chunking a text file using the new line character as the landmark definition would partition the text file into a sequence of chunks, where lines of the text file may be separate chunks. Landmark definitions that are actually used in practice tend to be more complicated to enable proper handling of file types other than text files. For example, a position in a data stream can be defined as a landmark location if the immediately preceding 48 bytes of data has a particular calculated value, such as a Rabin fingerprint equal to −1 mod a predefined number related to the average desired chunk size.
A benefit of landmark chunking algorithms is that local changes are likely to disturb only a small number of chunks. For example, in a text file, adding a word to one line in the middle of the document only disturbs that line (chunk). In contrast, if a text file were to be simply divided into fixed-size 80-character records, an added word in the text file would cause every record after the added word to be changed, which leads to increased computer processing.
Conventional landmark chunking algorithms that are applied to large input data can be very computing-intensive. For example, in the data backup or archiving context, relatively large amounts of data are processed during the backup or archiving operation. If the landmark chunking algorithm is not performed efficiently, then the backup or archiving operation may take a long time to complete.