Some existing storage systems support asynchronous deduplication of storage. In some existing asynchronous deduplication methods, all data is initially written to storage. Subsequently, when idle processing resources are available, blocks of the data are read back and a hash of each block is calculated. Records are created to track the location of blocks in storage and how many times a given block is referenced. Those records are searched to determine if a block is a duplicate, and if so to delete the duplicate on storage, and adjust the records accordingly. Some of the records are organized as one or more key-value tables maintained in a content-based chunk store. In that example, the records are indexed by a hash of a block of the data (e.g., the hash of the block of data is the key), and the value associated with the hash of the block of data is the reference count for that block of data, and its address in storage (e.g., HashOfData is a key into <ReferenceCount, AddressOfData>).
Under some existing asynchronous deduplication systems, deduplication is performed on data in storage. Therefore, an additional read of the data in storage is needed to calculate the hash of the data, and then a search of the key-value table is performed and duplicates in storage are deleted. However, this consumes additional resources, as more input/output (I/O) operations are required.
In synchronous (e.g., in-line) deduplication, data is deduplicated before being written to storage. However, synchronous deduplication is not feasible in many instances due to processor-intensive computations required to implement synchronous deduplication.
Corresponding reference characters indicate corresponding parts throughout the drawings.