1. Field of the Invention
The present invention relates to a computer program product, system, and method for managing dereferenced chunks in a deduplication system.
2. Description of the Related Art
Data deduplication is a data compression technique for eliminating redundant data to improve storage utilization. Deduplication reduces the required storage capacity because only one copy of a unique data unit, also known as a chunk or extent, is stored. Disk based storage systems, such as a storage management server or Virtual Tape Library (VTL), may implement deduplication technology to detect redundant data chunks, and reduce duplication by avoiding redundant storage of such chunks.
A deduplication system operates by dividing a file into a series of chunks, or extents. The deduplication system determines whether any of the chunks are already stored, and then proceeds to only store those non-redundant chunks. Redundancy may be checked with chunks in the file being stored or chunks already stored in the system.
An object may be divided into chunks using a fingerprinting technique such as Karp-Rabin fingerprinting. Redundant chunks are detected using a hash function, such as MD5 (Message-Digest Algorithm 5) or SHA-1 (Secure Hash Algorithm 1), on each chunk to produce a hash value for the chunks and then compare those hash values against hash values of chunks already stored on the system. Typically the hash values for stored chunks are maintained in an index (deduplication index). A chunk may be uniquely identified by a hash value, or digest, and a chunk size. The hash of a chunk being considered is looked-up in the deduplication index. If an entry is found for that hash value and size, then a redundant chunk is identified, and that chunk in the object can be replaced with a pointer to the matching chunk maintained in storage.
In a client-server software system, the deduplication can be performed at the data source (client), target (server) or on a deduplication appliance connected to the server. The ability to deduplicate data at the source or at the target offers flexibility in respect to resource utilization and policy management. There are also tradeoffs in use of source-side versus target-side deduplication. For example, source-side deduplication can conserve network bandwidth, but may also require deployment of special agent software to each source. Further, deduplication may be performed between multiple servers such that a source server sends data extents to the target server only if those extents are not already stored at the target server.