In computing, deduplication refers to a technique in which redundant data is deleted from the storage space to improve storage utilization. In the deduplication process, the goal is to retain only a single copy of the data that is to be stored, as opposed to storing multiple copies of the same data. Accordingly, blocks of data that are or are to be stored on a storage medium are compared to detect the duplicate copies. Each block of data is assigned an identification or a signature that is typically calculated using cryptographic hash functions.
In general, if the signatures of one or more data blocks are identical, then it is assumed that the data blocks are duplicates (i.e., bitwise identical). As such, the data block signatures for the file content are compared with signatures in a hash table—a data structure that maps the signature of the file content to data blocks in storage media. If there is a match, then instead of having duplicate copies of the same content, a corresponding reference is created to the matching content already stored in the file system. Thus, deduplication process may be applied to deduplicate files that are either being written to a storage media or to files that have been already stored on storage media. The former is referred to as real-time or online deduplication. The latter is referred to as post-processing or offline deduplication.
The common goal of both deduplication methods mentioned above is to minimize the volume of data stored on the storage media by respectively either preventing or removing the duplicate data blocks from a storage system. The online deduplication approach in addition helps reduce data transfer rate and improves communication bandwidth especially when storage media is remotely located with respect to the source, because in an inline deduplication approach, the duplicate data is not written to a storage device to begin with. As such, there is no need to transfer the data from a source to a destination storage device, if it is determined that the data is a duplicate of data already stored on the destination storage device.