Data deduplication is a data compression technology that reduces data volume by identifying and eliminating redundant data. The first deduplication technologies for “single-instance storage” were based on file-grain deduplication. In file-grain deduplication, duplicate copies of the same file, whether or not having the same name, were detected based on content and only one copy of the file was stored on disk. However, file-grain technologies suffered from inefficiencies related to comparing data at such a course level: for example, if a single word was added to a document, single-instance storage would not recognize any data redundancies and would store the entire file as a “new” file.
One method for reducing the amount of network bandwidth and computer storage required for data deduplication is to divide individual files into blocks, identify duplicate data blocks, and only transfer and store a single instance of a duplicated data block. One method for dividing files (or other data sources such as complete filesystems) into blocks involves using fixed-size blocks. When generating data blocks, dividing a file into smaller blocks provides the potential benefit of generating more duplicate blocks, thus making the replicated data smaller when removing and accounting for the duplication. On the other hand, dividing files into larger blocks has the benefit of requiring less indexing and fewer block comparisons, making the block index smaller and the associated processing and memory overhead smaller for the replicated data.
The early technologies based on file-grain deduplication have largely disappeared in favor of block-grain deduplication, in which a file is treated as a collection of multiple data blocks. Block-grain deduplication operates within files and stores only unique or newly-created blocks. In operation, each block of a file is compared to known blocks. If a block has been previously stored, the block is simply referenced rather than stored again. Each block, stored only once, is compressed using other encoding technologies. Accordingly, block-grain deduplication delivers greater compression ratios than single-instance storage.
The first block-grain technologies used fixed-block deduplication in which the block size was preset to a constant value. While fixed-block deduplication was an improvement over file-grain deduplication, fixed-block deduplication fails to tolerate the insertion of data at the beginning or in the middle of a file. That is, when data is inserted in a file, fixed-block deduplication will detect all subsequent blocks as new blocks, thus resulting in a lower deduplication compression ratio. In such a scenario, fixed-block deduplication suffers from inefficiencies approaching those associated with file-grain deduplication technologies.
Variable-block deduplication technologies addressed the problems of data insertion inherent in fixed-block technologies, but at the cost of incurring additional processing. Variable-block deduplication sets block boundaries by identifying particular markers (so-called “magic numbers”) within the file's data. While variable-block deduplication technologies improved compression ratios, performance was slowed due to the increased overhead associated with identifying the block boundaries.