1. Field of the Invention
The present invention relates generally to data storage, and in particular to managing storage of data objects through adaptive deduplication.
2. Background Information
Digitization of large volumes of data and an increase in richness of content of data have lead to high demands for online data storage requirements. One way to counter this issue is to add additional hardware resources. However, in the storage domain, addition of more storage often results in a disproportionate increase in the Total Cost of Ownership (TCO). Though the cost of acquisition has retreated as a result of reduction in hardware costs, the cost of management, which mainly consists of administration costs and power/energy costs, has increased. Many companies are attempting to provide a better solution by using data footprint reduction technologies such as deduplication.
Data deduplication removes or minimizes the amount of redundant data in the storage system by keeping only unique instances of the data on storage. Redundant data is replaced with a pointer to the unique data copy. By reducing space requirements, deduplication reduces the need for new storage resources. A good implementation of deduplication can often lead to substantial savings, both in terms new resource acquisition costs and management costs which include administration and energy costs, thereby leading to a significant reduction in TCO. In backup environments, deduplication can also lower the network bandwidth requirement for remote backups, replication and disaster recovery by reducing the amount of transmitted data on network.
Deduplication techniques can be classified into three major categories, namely: Whole File Hashing (WFH), Sub-File Hashing and Delta Encoding (DELTA). Sub-File Hashing can be further classified into Fixed Chunk Size Hashing (FSH) and Variable Chunk Size Hashing (VSH). Another less explored alternative is Content Aware Sub-File Hashing which can be further classified into Content Aware Fixed Size Hashing (CAFSH) and Content Aware Variable Size Hashing (CAVSH).
Different deduplication alternatives and parameters have different characteristics in terms of Metadata Overhead (MO, bookkeeping space for deduplication, including the hash table, the chunk list for each file, etc.), Folding Factor (FF, data reduction ratio without considering MO), Deduplication Time (DT, time for deduplicating data) and Reconstruction Time (RT, time for restoring the deduplicated data), for different types of data.
For example, in general FF(WFH)<FF(FSH)<FF(VSH). However, in multi-user backup dataset, WFH works quite well for source code and system files, which implies that FSH and VSH are not necessary especially considering their high overhead and resource consumption. For some data files such as composite files (tar, iso etc.), content aware approaches (CAFSH and CAVSH) can improve the FF compared with corresponding content agnostic approaches (FSH and VSH). Considering MO, the following represents the overhead: MO(WFH)<MO(FSH) and MO(WFH)<MO(VSH). The smaller the fixed/average chunk size, the larger MO the FSH/VSH has. Content aware approaches have slightly larger MO than corresponding content agnostic approaches due to slightly more chunks.
In terms of deduplication time, in general: DT(WFH)<DT(FSH)<DT(VSH). Content aware approaches have slightly larger DT than corresponding content agnostic approaches because of the format analysis.
For reconstruction time, in general: RT(WFH)<RT(FSH) and RT(WFH)<RT(VSH), where RT(FSH)/RT(VSH) increases as the fixed/average chunk size decreases. Similarly, Content aware approaches have slightly larger RT than corresponding content agnostic approaches due to slightly more chunks.
Moreover, FSH, VSH, CAFSH, CA VSH also have different tunable parameters mainly used to control the (average) chunk size, which effect the FF, MO, DT and RT. In general, the smaller the fixed/average chunk size, the higher FF, the larger MO, and the longer DT and RT. However, the FF will not be improved with smaller chunk sizes for some files like mp3, etc.
Besides the selection of different deduplication approaches, another interesting and important design choice for a deduplication system is when and where to perform the deduplication—the architecture (timing and placement) of deduplication systems, in a distributed or network attached environments. There are three common architectures: 1) on-line deduplication at deduplication client side; 2) on-line deduplication at deduplication appliance or deduplication-embedded storage system side; and 3) off-line deduplication at deduplication appliance or deduplication-embedded storage system side. On-line means the data is deduplicated before transferring (client side) or writing to the storage (appliance or storage side). Off-line refers to first store the data in storage, and then perform the background deduplication on stored data. Different architectures have different tradeoffs on ingestion rate, network bandwidth requirement, storage requirement, CPU and I/O utilization, security impact, etc.