1. Field of the Invention
The present invention relates to a computer program product, system, and method for identifying modified chunks in a data set for storage.
2. Description of the Related Art
Data deduplication is a data compression technique for eliminating redundant data to improve storage utilization. Deduplication reduces the required storage capacity because only one copy of a unique data unit, also known as a chunk, is stored. Disk based storage systems, such as a storage management server and Volume Tape Library (VTL), may implement deduplication technology to detect redundant data chunks, such as extents or blocks, and reduce duplication by avoiding redundant storage of such chunks.
A deduplication system operates by dividing a file into a series of chunks, or extents. The deduplication system determines whether any of the chunks are already stored, and then proceeds to only store those non-redundant chunks. Redundancy may be checked with chunks in the file being stored or chunks already stored in the system.
An object may be divided into chunks using a fingerprinting technique such as Karp-Rabin fingerprinting. Redundant chunks are detected using a hash function, such as MD5 (Message-Digest Algorithm 5) or SHA-1 (Secure Hash Algorithm 1), on each chunk to produce a hash value for the chunks and then compare those hash values against hash values of chunks already stored on the system. Typically the hash values for stored chunks are maintained in an index (dedup index). A chunk may be uniquely identified by a hash value, or digest, and a chunk size. The hash of a chunk being considered is looked-up in the dedup index. If an entry is found for that hash value and size, then a redundant chunk is identified, and that chunk in the set or object can be replaced with a pointer to the matching chunk maintained in storage.
In a client-server software system, the deduplication can be performed at the data source (client), target (server) or on a de-duplication appliance connected to the server. The ability to deduplicate data at the source or at the target offers flexibility in respect to resource utilization and policy management. Typically, the source and target systems have the following data backup protocol:                1. Source identifies data extent D in file F.        2. Source generates a hash value h(D) for the data extent D.        3. Source queries the target if the target already has a data extent with hash value h(D) and size l(D).        4. If the target responds “yes”, the source simply notifies the target that extent with hash h(D) and size l(D) is a part of file F.        5. If the target responds “no”, the source sends the data extent D with its hash h(D) and size l(D) to the target. Target stores D in a storage pool and enters h(D) and l(D) into the de-dup index.        6. If more extents are to be processed, go to Step 1.        
In fixed size block or variable size block where chunk boundaries can be determined without examining the data (e.g. without fingerprinting), the changed physical blocks can be mapped directly to deduplicated copies of the blocks in storage. However, there is a need in the art to provide improved techniques for determining changed chunks in systems having variable size chunks, whose boundaries are determined by examining the data (e.g. fingerprinting), such as variable size blocks and extents.