Computer systems and networks have evolved towards more efficient systems and faster networks. As a result, computer systems have larger memories for storing information such as data files and application programs, and computer networks have greater bandwidth for transmitting information. As the amount of information to be stored and transmitted continues to increase, the efficiency and speed of the computer systems and networks can be further improved by more efficiently and rapidly storing, retrieving and transmitting the information. Various systems and methods have been developed to carry out the efficient and rapid processing of the information. The systems and methods may use chunking algorithms to achieve improved efficiency and speed.
Chunking algorithms partition data composed of a sequence of bytes into nonoverlapping chunks. Landmark chunking algorithms determine partitioning by using landmarks present in the data as chunk dividing points. Landmarks are local patterns of data around a point. For example, a landmark might be considered any point in a data stream immediately following a newline character. Landmark chunking a text file using the newline character as the landmark definition would partition the text file into a sequence of chunks, where each line of the text file is a separate chunk. Landmark definitions that are actually used in practice tend to be more complicated to enable proper handling of file types other than text files. For example, a point can be defined as a landmark if the immediately preceding 48 bytes have a Rabin fingerprint equal to −1 mod a prespecified number related to the average desired chunk size.
Landmark chunking algorithms have many advantages. Perhaps the most useful is that local changes only disturb a small number of chunks. For example, in a text file example adding a word to one line in the middle of the document only disturbs that chunk, whereas simple division of the text file into fixed-size 80 character records causes every record after the added word to be different. Landmark chunking algorithms are thus especially suited for compacting related data by keeping only one copy of each chunk.