As companies and individuals are increasingly using and relying on their computer systems and networks, the need for more efficient systems and faster networks is becoming more important. As a result, computer systems now have larger memories for storing information (e.g., data files and application programs) and computer networks have greater bandwidths for transmitting information. As the amount of information to be stored and transmitted continues to increase, the efficiency and speed of the computer systems and networks can be further improved by more efficiently and rapidly storing, retrieving and transmitting the information. Various systems and methods have been developed to carry out the efficient and rapid processing of the information. These systems and methods may utilize stateless chunking algorithms to achieve improved efficiency and speed.
Chunking algorithms break a long byte sequence S into a sequence of smaller sized blocks or chunks c1, c2, . . . , cn. This is preferably done in such a manner that the chunk sequence is stable under local modification of S. Stability under local modification means that if we make a small modification to S, resulting in S′, and apply the chunking algorithm to S′, most of the chunks created for S′ are identical to the chunks for S. The term “stateless” in the name of the algorithm implies that to perform its task, the algorithm relies only on the byte sequence S as input and is not allowed to look at other transient or state dependent information that might be available. With unstable chunking algorithms, even minor insertions or deletions in the middle of a sequence will shift all the chunk boundaries following the modification point. Shifting chunk boundaries tends to result in different hash values, and, as a result, will typically result in the storage and/or transmittal of a large amount of unchanged data simply because it follows an insertion or deletion.
Chunking overhead is a measure of the amount of data that needs to be communicated and stored over and above the data that is actually contained in a modified sub-sequence. Reducing chunking overhead increases the efficiency of the apparatus that is using the chunking algorithm with regard to the apparatus's usage of communication and storage resources.
The need for chunking algorithms that are stable under local modification arises in at least two contexts: (1) archival file systems; and (2) low bandwidth network file systems. Unfortunately, previously known chunking methods and apparatus comprising such methods leave much to be desired both in regard to stability and efficiency. As such, the present disclosure is directed to methods and apparatus that can provide additional stability and/or efficiency, particularly when embodied in archival and low bandwidth network file systems.