Data storage solutions can be enhanced by introducing a form of compression known as “deduplication”. Deduplication generally refers to the elimination of redundant subfiles from data objects, these subfiles generally referred to as blocks, chunks, or extents. The deduplication process is usually applied to a large collection of files in a shared data store, and its successful operation greatly reduces the redundant storage of common data.
In a typical configuration, a disk-based storage system such as a storage-management server or virtual tape library has the capability to perform deduplication by detecting redundant data chunks within its data objects and preventing the redundant storage of such chunks. For example, the deduplicating storage system could divide file A into chunks a-h, detect that chunks b and e are redundant, and store the redundant chunks only once. The redundancy could occur within file A or with other files stored in the storage system. Deduplication can be performed as objects are ingested by the storage manager (in-band) or after ingestion (out-of-band).
Typically, when performing deduplication, the object is divided into chunks using a method such as Rabin fingerprinting. Redundant chunks are detected using a hash function such as MD5 or SHA-1 to produce a hash value for each chunk, and this hash value is compared against values for chunks already stored on the system. The hash values for stored chunks are typically maintained in an index. If a redundant chunk is identified, that chunk can be replaced with a pointer to the matching chunk.
Advantages of data deduplication include requiring reduced storage capacity for a given amount of data; providing the ability to store significantly more data on a given amount of disk; and improving the ability to meet the required recovery time objective when restoring from disk rather than tape.
With the advent of data deduplication technologies, new challenges have arisen for data protection and management applications. Although data deduplication can make it possible to store more data on disk to improve restore performance as compared to tape, deduplication can result in degraded performance as compared to accessing non-deduplicated data from disk. Deduplication of data may result in a sub-optimal data restore scenario because data now has to be read back as chunks and those chunks may be dispersed across many different volumes within a given deduplication system. This dispersal of data caused by data deduplication could impact restore performance for data, and compromise the ability of a data protection product to meet the required recovery objectives.
Data deduplication technologies also put data at risk. Particularly, as deduplication algorithms and processes assign references to a given chunk of data because it is the same (based on a hash value) and cause data to be deduplicated, this increases the “single point of failure” risk to all files that reference that data chunk. For example, if deduplication causes 100 files to reference the same single instance of a chunk of data, if that data chunk is lost (such as from media failure, corruption, etc.), then 100 files are impacted by this loss. Without data deduplication, of those 100 actual files, only those specifically impacted by the media failure would be lost, and if the loss only affected something as small as one block on disk, the data loss may have only affected a single file of those 100.
Currently there are no known solutions to these issues as the state of the art does not currently allow for tuning or policies that allow for the benefits of space reduction achieved by data deduplication. What is needed is tuning and policies combined with the ability to optimize restore performance to meet recovery time objectives, and to mitigate the “single point of failure” risk inherent in storage products that use deduplication technologies. Additionally, the ability is needed to realize the benefits of data deduplication for general data, while also optimizing restore performance and minimizing risk for the most critical data.