As computers and computer data become increasingly prevalent, the amount of data being stored tends to increase. Advances in storage technology have improved storage system capabilities. Nonetheless, given that storing more data typically requires more storage capacity, and given that storage capacity comes with a price, there is significant interest in reducing the amount of storage space used to store data.
One technique used to reduce the amount of storage space used to store a given amount of data is known as deduplication. Deduplication involves identifying duplicate data and storing a single copy of the duplicate data, rather than storing multiple copies. For example, if two identical copies of a portion of data (e.g., a file) are stored on a storage device, deduplication involves removing one of the copies and instead storing a reference to the removed copy. If access to the removed copy is requested, the request is redirected and the reference is used to access the remaining copy. Since the reference is typically relatively small, relative to the copy of the portion of data, the added space used to store the reference is more than offset by the space saved by removing the duplicate copy.
In order to expedite the process of determining whether identical data is already stored, deduplication engines typically divide the data into portion, or segments, and calculate a signature, or fingerprint for each segment. When a segment is stored, the fingerprint that represents the segment can be added to a list of fingerprints representing stored segments. Then, by comparing a segment's fingerprint with the fingerprints included in the listing of fingerprints, the deduplication engine can determine if the segment is already stored. If so, rather than store another copy of the segment, a reference is stored and a reference counter is updated.
Occasionally, it is desired to migrate data from one storage location (e.g., a source) to another storage location (e.g., a destination). However, this can be complicated if either or both of the source and destination is deduplicated. This is particularly true if the source and target do not use identical deduplication methodology, or schema. For example, if the destination is unable to properly interpret the fingerprints and/or references of the source, the data must be rehydrated and migrated in non-deduplicated form. Once the data has been transmitted from the source to the destination, the data is then deduplicated by the destination according to the deduplication methodology employed by the destination.
This presents several undesirable outcomes. For example, rehydrating and re-fingerprinting data uses significant resources, e.g., computer processing cycles. Furthermore, the rehydrated data is likely to be significantly larger, in terms of bytes, than the deduplicated data. The source and/or destination may be incapable of storing the larger quantity of data. Also, transmitting what may be terabytes (or more) of data consumes network bandwidth, which is typically not unlimited. Furthermore, such migration operations often are scheduled for finite windows of time, and the duration of the migration operations may exceed the allowed window.
Another problem that results from rehydrating and re-fingerprinting data to migrate the data is that the old fingerprints (i.e. those used by the source) are not available at the destination, so data written to the destination cannot be deduplicated against those fingerprints. This perpetuates the necessity of rehydrating and re-fingerprinting data that is migrated to the destination.
What is needed is a way to mitigate or avoid the significant resource consumption involved with rehydrating and re-fingerprinting data, as well as transmitting the increased quantities of data, when migrating data between systems that use dissimilar deduplication methodologies. Such a system would not only avoid the problems discussed above, but would also allow deduplication against the original source fingerprints, further improving the efficiency of migration operations and allowing the migration operations to complete within specified time windows.
While the invention is susceptible to various modifications and alternative forms, specific embodiments of the invention are provided as examples in the drawings and detailed description. It should be understood that the drawings and detailed description are not intended to limit the invention to the particular form disclosed. Instead, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the invention as defined by the appended claims.