1. Field of the Invention
The present invention relates to a computer program product, system, and method for deduplicating chunk digests received for chunks in objects in objects provided by clients to store in a storage.
2. Description of the Related Art
Data deduplication is a data reduction technique for eliminating redundant data to improve storage utilization. Deduplication reduces the required storage capacity because only one copy of a unique data unit, also known as a chunk, is stored. Disk based storage systems, such as a storage management server or Virtual Tape Library (VTL), may implement deduplication technology to detect redundant data chunks and reduce duplication by avoiding redundant storage of such chunks. Storage-based data deduplication reduces the amount of storage needed for a given set of files and is most effective in applications where many copies of very similar or even identical data are stored on a single disk, which is common. In the case of data backups, which are routine and performed to protect against data loss, most of data in a given backup has not changed from the previous backup, and may present many opportunities for deduplication to eliminate redundant storage of data.
Data deduplication may operate at the file or block level. File deduplication eliminates duplicate files. Block deduplication looks within a file and saves unique iterations of each block. Block deduplication system operates by dividing a file into a series of chunks. The deduplication system determines whether any of the chunks are already stored, and then proceeds to only store those non-redundant chunks. Redundancy may be checked with chunks in the file being stored or chunks already stored in the system.
A chunk may be uniquely identified by a digest calculated from the chunk data. If an entry is found for a digest of chunk data, then a redundant chunk is identified, and that chunk in the data object or object can be replaced with a pointer to the matching chunk maintained in storage.
If a deduplication appliance or manager is receiving thousands of chunks of data to deduplicate, the deduplication appliance may have to stall the ingest streams to allow for the indexing of chunk digests for the data chunks so that multiple copies of a chunk of data are not stored. Other techniques for managing the processing of numerous received chunks to deduplicate are to write/commit extents on a per chunk basis or write data twice and clean up later.
There is a need in the art for improved techniques for performing deduplication operations.