Data storage solutions can be enhanced by introducing a form of compression known as “deduplication”. Deduplication generally refers to the elimination of redundant subfiles from data objects, these subfiles generally referred to as blocks, chunks, or extents. The deduplication process is usually applied to a large collection of files in a shared data store, and its successful operation greatly reduces the redundant storage of common data.
In a typical configuration, a disk-based storage system such as a storage-management server or virtual tape library has the capability to perform deduplication by detecting redundant data chunks within its data objects and preventing the redundant storage of such chunks. For example, the deduplicating storage system could divide file A into chunks a-h, detect that chunks b and e are redundant, and store the redundant chunks only once. The redundancy could occur within file A or with other files stored in the storage system. Deduplication can be performed as objects are ingested by the storage manager (in-band) or after ingestion (out-of-band).
Known techniques exist for deduplicating data objects. Typically, the object is divided into chunks using a method such as Rabin fingerprinting. Redundant chunks are detected using a hash function such as MD5 or SHA-1 to produce a hash value for each chunk, and this hash value is compared against values for chunks already stored on the system. The hash values for stored chunks are typically maintained in an index. If a redundant chunk is identified, that chunk can be replaced with a pointer to the matching chunk.
Advantages of data deduplication include requiring reduced storage capacity for a given amount of data; providing the ability to store significantly more data on a given amount of disk; and improving the ability to meet recovery time objective (RTO) when restoring from disk rather than tape.
Although deduplication offers these potential benefits, it also introduces new risks of data loss for any of several reasons. The first risk is false matches. It is possible that two different chunks could hash to the same value (called a collision), causing the system to deduplicate an object by referencing a chunk that does not match. Depending on the hash function used, the probability of such a collision may be extremely low but is still finite. Avoidance techniques include combining multiple hashes against the same chunk, comparing other information about chunks, or performing a byte-by-byte comparison. However, these techniques may involve additional, time-consuming processing for assessing every chunk or byte.
Additionally, deduplication increases the potential impact of media failure. If one chunk is referenced by multiple data objects, loss of that one chunk due to media error or failure could result in data loss for many objects. Similarly, a higher risk for logic errors also exists because deduplication adds significant complexity to a storage system, thus creating the potential for data loss due to a programming error.
A solution is needed to achieve the benefits of deduplication while also providing protection against data loss from mechanisms such as those described above.