1. Field of the Invention
The present invention described herein relates generally to storage systems, and in particular to those systems using data deduplication methods to reduce storage utilization.
2. Description of the Related Art
A common goal of most storage systems is to reduce the storage of duplicate data. One technique used to attain this goal is referred to as “deduplication”. Deduplication is a process whereby redundant copies of the same file or file segments within a storage system are deleted. In this manner, a single instance of a given file segment (or other portion of data) is maintained. Such an approach is often referred to as single instance storage.
The advantage of deduplication is simple; it reduces storage consumption by only storing unique data. In a typical storage system, it is common to find duplicate occurrences of individual blocks of data in separate locations. Duplication of data blocks may occur when, for example, two or more files share common data or where a given set of data occurs at multiple places within an individual file. With the use of deduplication, however, only a single copy of a file segment is written to the storage medium, thereby reducing memory consumption.
The process of data deduplication is often utilized in backup storage applications. Backup applications generally benefit the most from deduplication due to the requirement for recurrent backups of an existing file system. Typically, most of the files within the file system will not change between consecutive backups, and therefore do not need to be stored.
When a backup data stream is received by a storage system, the data stream is generally partitioned into segments. After partitioning, a fingerprint or other unique identifier is generated from each data segment. The fingerprint of each new data segment is compared to an index of fingerprints created from previously stored data segments. If a match between fingerprints is found, then the newly received data segment may be identical to one already stored in the storage system (i.e., represents redundant data). Therefore, rather than storing the new data segment, this data segment is discarded and a reference pointer is inserted in its place which identifies the location of the identical data segment in the backup data storage system. On the other hand, if the fingerprint does not have a match in the index, then the new data segment is not already stored in the storage system. Therefore, the new fingerprint is added to the index, and the new data segment is stored in the backup storage system.
In a typical deduplication based storage system, the input data stream is partitioned into fixed size segments following the exact sequence of the contiguous data stream. One drawback of this approach is that it fails to eliminate many redundant segments if the alignment between consecutive backup data streams is slightly different. For example, as noted above, when a single machine performs a backup of a given storage system snapshot, most of the data being sent to the backup storage medium will be unchanged from the previous snapshot. However, any individual file modification or deletion within the snapshot image may shift segment boundaries and result in the creation of a totally different set of segments. Consequently, many segments for a given file will not be identical to previous segments for the file—even though most of the data for the file remains unchanged.
Another drawback with current approaches to deduplication is the strain put on system resources from managing a large number of stored segments and managing the deduplication process. If maximizing the deduplication ratio were the only goal, then choosing a smaller segment size to partition the backup data stream may achieve this goal. However, with a smaller segment size, the number of segments and the fingerprint index may grow too large to be easily managed. For example, as the size of the fingerprint index grows, eventually the index may exceed the size of the available physical memory. When this happens, portions of the index must be stored on disk, resulting in a slowdown in reads and writes to the index, and causing overall sluggish performance of the deduplication process. Additionally, when portions of the index are stored on disk, the task of searching for a fingerprint match will often be the bottleneck delaying the deduplication process. Ideally, the entire fingerprint index is stored in physical memory, and to accomplish this, additional techniques are needed to keep the size of the index relatively small while still achieving a high deduplication ratio. Also, generating fingerprints utilizes valuable processing resources. Thus, reducing the number of fingerprints generated may also decrease the burden on the processing and memory resources of the deduplication storage system.
In view of the above, improved methods and mechanisms for managing deduplication of data are desired.