Deduplication for secondary storage systems has recently seen a lot of attention in both research and commercial applications. Deduplication offers significant reductions in storage capacity requirements by identifying identical blocks in the data and storing only a single copy of such blocks. Previous results have shown that significant duplication exists in backup data. This is not surprising, given that subsequent backups of the same systems are usually very similar.
Deduplicating storage systems vary on a number of dimensions. Some systems only deduplicate identical files, while others split the files into smaller blocks and deduplicate those blocks. The present invention will focus on block-level deduplication, because backup applications typically aggregate individual files from the filesystem being backed up into large tar-like archives. Deduplication on the level of files would not give much space reduction.
The blocks can be of fixed or variable size, with variable sized blocks typically produced by content defined chunking. Using content-defined variable-sized blocks was shown to improve the deduplication efficiency significantly.
Most systems eliminate identical blocks, while some only require the blocks to be similar and store the differences efficiently. While this can improve deduplication effectiveness, it requires reading the previous blocks from disk, making it difficult to deliver high write throughput. The present invention will therefore focus on Identical block deduplication in this paper.
(Overview of Deduplicating Storage)
A backup storage system is typically presented with long data streams created by backup applications. These streams are typically archive files or virtual tape images. The data streams are divided into blocks, and a secure hash (e.g. SHA-1) is computed for each of the blocks. These hash values are then compared to hashes of blocks previously stored in the system. Since finding a hash collision for secure hash functions is extremely unlikely, blocks with the same hash value can be assumed to be identical (so called Compare by Hash). Therefore, if a block with the same hash is found, the block is considered a duplicate and it is not stored. The identifiers of all blocks comprising the data stream are stored and can be used to reconstruct the original data stream on read.