1. Field of the Invention
The present invention relates to a method, system, and article of manufacture for managing metadata for data blocks used in a deduplication system
2. Description of the Related Art
Data deduplication (often called “intelligent compression” or “single-instance storage”) is a method of reducing storage space used to store data by eliminating redundant data in files sharing common data. In deduplication systems, only one unique instance of the data is actually retained on storage media, such as disk or tape, and additional instances of the data in different files or databases may be replaced with a pointer to the unique data copy. Thus, if only a few bytes of a new file being added are different from data in other files, then only the new bytes are stored for the new file and pointers are included in the added file that reference the common data in other files or databases.
In a deduplication system, metadata for data blocks included in presently stored files include a hash value generated from the content of the data block. The data blocks subject to deduplication are usually at the subfile level. When adding a file comprised of data blocks, a hash may be applied to each data block to determine whether the hash of the data block in the file being added matches the hash value in metadata. If there is a match, the data block in the file is replaced with a pointer or reference to the metadata having the matching hash value.
In a deduplication system, metadata is maintained for each data block included in currently stored files, where the data block comprises a subfile element. When the file including the data block is removed and the metadata for a data block is not referenced in another file, then the metadata is removed.
FIG. 1 illustrates a system known in the art for storing data blocks and file metadata. A metadata storage stores file metadata, e.g., files F0 and F1, that lists a pointer, e.g., PA, PC, PE, PG, PH, PJ, PL, PN, for each data block included in a file, where the order of the data block pointers in the file metadata FO, F1 provides an ordered list of the data blocks in the file and the block sizes. The file metadata FO, F1 further includes the length of each data block LB, LD, LF, LH, LI, LK, LM, LO. A file data block storage stores the actual data blocks e.g., PA, PC, PE, PG, PH, PJ, PL, PN, that are referenced in file metadata in the metadata storage.
There is a need in the art for improved techniques for managing metadata used in deduplication.