In computing, deduplication refers to a technique in which redundant data is deleted from the storage space to improve storage utilization. In the deduplication process, the goal is to retain only a single copy of the data that is to be stored, as opposed to storing multiple copies of the same data. Accordingly, blocks of data that are stored on a storage medium are compared to detect the duplicate copies. Each block of data is assigned an identification or a signature that is typically calculated using cryptographic hash functions.
In general, if the signatures of one or more data blocks are identical, then it is assumed that the data blocks are duplicates (i.e., bitwise identical). As such, when a new file is to be stored, the data block signatures for the file content are first compared with signatures in a hash table. The hash table is a data structure that maps the signature of the file content to data blocks in storage media. If there is a match, then the file content is not copied, and instead a corresponding reference is created to the matching content already stored in the file system. This approach requires maintaining an index system (i.e., the hash table and the related software).
The above-noted index system is exclusive of the file system that is used to manage the files on the storage media and is thus separately implemented and maintained. For example, in a system that supports full-object deduplication, generally an elaborated data structure such as an extensible hash table, Btree or other complex data structure is used to implement the hash directory. Each system component has separate cluster management and scale-out capabilities resulting in redundancies and inefficiencies.