1. Field of the Invention
The present invention relates to a computer program product, system, and method for encrypting data objects to back-up to a server.
2. Description of the Related Art
Data deduplication is a data compression technique for eliminating redundant data to improve storage utilization. Deduplication reduces the required storage capacity because only one copy of a unique data unit, also known as a chunk, is stored. Disk based storage systems, such as a storage management server or Virtual Tape Library (VTL), may implement deduplication technology to detect redundant data chunks and reduce duplication by avoiding redundant storage of such chunks.
A deduplication system operates by dividing a file into a series of chunks. The deduplication system determines whether any of the chunks are already stored, and then proceeds to only store those non-redundant chunks. Redundancy may be checked with chunks in the file being stored or chunks already stored in the system.
An object may be divided into chunks using a fingerprinting technique such as Rabin-Karp fingerprinting. Redundant chunks are detected using a hash function, such as MD5 (Message-Digest Algorithm 5) or SHA-1 (Secure Hash Algorithm 1), on each chunk to produce a hash value for the chunks and then compare those hash values against hash values of chunks already stored on the system. Typically the hash values for stored chunks are maintained in an index (deduplication index). A chunk may be uniquely identified by a hash value, or digest, and a chunk size. The hash of a chunk being considered is looked-up in the deduplication index. If an entry is found for that hash value and size, then a redundant chunk is identified, and that chunk in the data object or object can be replaced with a pointer to the matching chunk maintained in storage.
In a client-server software system, the deduplication can be performed at the data source (client), target (server) or on a de-duplication appliance connected to the server. The ability to deduplicate data at the source or at the target offers flexibility in respect to resource utilization and policy management. Typically, the source and target systems have the following data backup protocol:                1. Source identifies data chunk D in file F.        2. Source generates a hash value h(D) for the data chunk D.        3. Source queries the target if the target already has a data chunk with hash value h(D) and size l(D).        4. If the target responds “yes”, the source simply notifies the target that chunk with hash h(D) and size l(D) is a part of file F.        5. If the target responds “no”, the source sends the data chunk D with its hash h(D) and size l(D) to the target. Target stores D in a storage pool and enters h(D) and l(D) into the de-dup index.        6. If more chunks are to be processed, go to Step 1.        
There is a need in the art for improved techniques for protecting data involved in deduplication.