Many companies and individuals with large amounts of stored data employ a backup data storage system. These backup data storage systems can be located local to the data to be backed up or at a remote site. The backup data storage systems can be managed by the entity controlling the primary data storage devices or a backup data storage service company. Data can be backed up at any frequency and any amount of data can be backed up. The backed up data can be retrieved in the case of a failure of a primary storage device from the backup data storage system. Where the backup is managed by a service company the data of many separate clients of the backup service can be backed up into the backup data storage system.
Compression of data before storage is widespread within the backup storage service market. Delta compression is a particular type of compression that is applied in this field. The basic idea of delta compression is to divide incoming data into smaller units called data chunks, generate a resemblance hash over the data chunk (this hash result is referred to herein as a “sketch”), and check the sketch against an index of previously stored data chunks. A sketch can be generated by identifying “features” of a data chunk that would not likely change even as small variations are introduced into the data chunk. In one example embodiment, a rolling hash function (e.g., to generate a Rabin fingerprint) is applied over all overlapping small regions of the data chunk (e.g., a 32 byte window) and the features are selected from maximal hash values generated in the process. This can be done with multiple hash functions to generate any number of features (e.g., generating four features). These features can be further processed to generate “super features,” which can be generated by the application of a hashing function over the feature values. While features are described herein as being used for similarity matching, one skilled in the art would understand that super features can also be used for this purpose and are considered herein as a subset of the use of features and feature matching.
Data chunks with identical or similar sketches are considered for delta compression, while sketches that are not found in the index cause the corresponding data chunk to be stored and the sketch added to the index. In this way only a sub-set of the data chunks need to be stored in their entirety. For each data chunk with a matched sketch the system encodes a data chunk as a reference to the similar data chunk (i.e., the data chunk with the matching sketch) and a difference or change in the data chunk being compressed relative to the similar data chunk. In this way, only the original data chunk and a difference (i.e., the delta) between the two similar data chunks are stored rather than two entire data chunks, thereby reducing the amount of data that must be stored significantly.
Each of the data chunks also has a secured hash executed over it to generate a result that is referred to herein as a “fingerprint.” The fingerprint can be utilized to identify a specific data chunk. The data chunks can be portions of a file. The file has a recipe for reconstruction, which consists of a list of fingerprints and related information corresponding to the chunks stored in the backup data storage system. For backup data storage systems, the typical backup cycle consists of daily or weekly full backups of a primary data storage system, on most of the data stored in the backup data storage system is repeated (i.e., a duplicate) or has minor alterations. These minor alterations can be effectively delta compressed to avoid having to store significant amounts of data.
Delta compression also has applicability to data transmission where instead of minimizing an amount of data storage space that is required, the delta compression seeks to minimize the amount of data that must be transmitted thereby reducing bandwidth requirements. The delta compress transmission process works in a similar manner to that of the delta compression for storage, except that instead of storing the delta compressed data, the delta compressed data is transmitted to a receiving computer. Fingerprint data is also transmitted to the remote machine such that a fingerprint index can be maintained at the receiving machine to enable proper reconstruction of the transmitted data at the receiving machine.