Data compression is a widely used technique for storage and network transfer. Compression is a class of data transformation techniques to represent information with fewer bits than its original form by exploiting statistical redundancy. Applying compression is a common technique to reduce the amount of data that needs to be stored or moved. There are numerous compression algorithms such as Lempel-Ziv (LZ), Huffman encoding, Deflate (a combination of LZ and Huffman), and many others. In addition, there are techniques that supplement compression by making data more effectively compressed: the Burrows-Wheeler Transform is a widely used example of such techniques. These compression techniques are lossless in the sense that decompression of compressed data will result in the exact same data as the beginning state.
When compressors run, they typically have a window over which they can find internal redundancy that may be in the tens of kilobytes up to megabytes. (Windows up to 1 GB or more are possible as well; the larger the redundancy window, the longer it takes to compress data.) This is due to internal structures that are optimized to find redundancy nearby as compared to distantly. As an example, a reference may only have 16 bits, so it can only reference data that is 2^16 bytes away. Generally, there is a tradeoff between computation and compressibility: often much of the available compression in a dataset can be achieved with a small amount of computation, and more extensive computation (and memory) can result in better data reduction. It is advantageous to find redundancy over larger intervals while minimizing the additional overhead in computation and memory.
There have been previous techniques, often called delta or similarity compression, that compress some data relative to other data. One way to apply delta compression is to divide incoming data into smaller units called data chunks, generate a resemblance hash over a portion of the data chunk, such as a super feature or pairwise intersection of one or more super features (referred to herein as a “sketch”), and check the sketch against an index of previously stored data chunks. A sketch can be generated by identifying “features” of a data chunk, most of which would likely not change even as small variations are introduced into the data chunk. In one example, a rolling hash function (e.g., a Rabin fingerprint) is applied over all overlapping small regions of the data chunk (e.g., a 32-byte window) and the features are selected from maximal hash values generated in the process. This can be done with multiple hash functions to generate any number of independent features (e.g., generating sixteen features). These features can be further processed to generate “super features,” which can be generated by the application of a hashing function over subsets of the feature values. For example, if sixteen features are grouped into four super features, a change to one of the sixteen features affects the super feature it is grouped into, but the three other super features are unchanged. While features are described herein as being used for similarity matching, one skilled in the art would understand that super features can also be used for this purpose and are considered herein as a subset of the use of features and feature matching.
Data chunks with identical or similar sketches are considered for delta compression, while sketches that are not found in the index cause the corresponding data chunk to be stored and the sketch added to the index. In this way only a sub-set of the data chunks need to be stored in their entirety. For each data chunk with a matched sketch the system encodes a data chunk as a reference to the similar data chunk (i.e., the data chunk with the matching sketch) and a difference or change in the data chunk being compressed relative to the similar data chunk. In this way, only the original data chunk and a difference (i.e., the delta) between the two similar data chunks are stored rather than two entire data chunks, thereby significantly reducing the amount of data that must be stored.
Each of the data chunks can also have a cryptographic hash executed over its content to generate a result that is referred to herein as a “fingerprint.” A fingerprint can be utilized to identify a specific data chunk. The data chunks can be portions of a file. The file has a recipe for reconstruction, which can consist of a list of fingerprints or storage locations and related information corresponding to the chunks stored in the data storage system.