1. Field
Embodiments of the invention relate to improving data deduplication by separating data from meta data.
2. Description of the Related Art
Storage management products store client data onto disk and/or tapes for backup purposes. This data can be stored without meta data to describe the data, but, to help guarantee data integrity, storage management software may also store its own meta data co-mingled with the file data. This added meta data helps detect tape processing errors (not detected by the drive itself) and allows further integrity by, for example, calculating Cyclic Redundancy Check (CRC) values on subsets of data, and storing the CRC values in the meta data. A cyclic redundancy check (CRC) may be described as a function that takes as input a data stream of any length and produces as output a value of a certain fixed size.
In order to help guarantee integrity, the meta data co-mingled with the client data may also contain other header information used to identify the source of the data (e.g., which client the data came from, name of a file, etc). When stored on disk or tape media, this meta data helps guarantee that the original data is returned to the client when the original data is recovered from the disk or tape media.
In most cases, this co-mingling of data with meta data is very useful to help guarantee the identity and correctness of the data stored within a storage management system. When a file is backed up to a storage management system, the file is generally embedded in other larger data structures on disk or tape. The file may then be copied from disk to tape for redundancy. Finally, to restore your file, the storage management system finds the original data making up the file and sends that data back from the storage management system to a storage management client to put back on a workstation. So, a storage management system may be described as including disk and tape volumes onto which data is stored, and, possibly a database used to track the location of data within the disk and tape volumes. Furthermore, it is common for many files from the same client or for data from different client files to be stored back-to-back on a single piece of media. For example, with common tape capacities well over 500 GigaBytes (500 GB), it may take thousands of client files to fill a single tape. This increases the need for accurate and unique meta data to describe the client data.
Data deduplication describes a scenario in which common data is reduced to a single copy and redundant copies are replaced with pointers to the original copy. For example, a first file includes chunks (e.g., extents) x-z, which are stored. If a second file is divided into chunks (e.g., extents) a-h and chunks b and e (out of chunks a-h) are redundant (i.e., the same as chunks y and z in the first file), then chunks b and e are not stored again. Instead, pointers to y and z are stored. Thus, with data deduplication, redundant chunks are stored once.
Data deduplication can happen at file boundaries or sub-file boundaries with fingerprinting techniques available as prior art (e.g., a Rabin fingerprinting scheme may be described as a specific technique that produces sub-file boundaries of various lengths). In particular, data is broken down into chunks, and each chunk is given a unique signature. One example of a signature is a Secure Hash Algorithm. A SHA-1 digest (one version of the Secure Hash Algorithm) takes a chunk of data and digests it into a single 160-bit value. Variations on chunk size and the number of objects determine the relative possibility of a “false-positive” digest match.
Fingerprinting may be described as the process of looking at a sequence of bytes of arbitrary size and calculating a signature over a small window of those bytes. For example, assume that this window is 64-bytes. In this example, fingerprinting starts at offset 0 in the sequence, takes the first 64-bytes, and generates a signature. This value is logically “ANDed” with a mask to yield the low-order “n” bits of the signature. If this residual value matches a pre-determined search value, then it is determined that this data is significant, and a chunk of data is defined at this boundary. If the residual value does not match the search value, fingerprinting moves the window one byte and repeats the process (bytes 2-65, this time, 3-66 the time after that, etc). The goal of fingerprinting is to break up a large piece of data into smaller chunks, where each chunk is then checked for redundancy. Based on mathematical probability, the average size of the chunks for completely random data will be 2^n, where “n” is the number of bits in the mask previously mentioned. Thus, to make the average chunk size larger, more bits are used in the mask and more bits in the search value. The larger the chunk size, the fewer number of chunks that are to be managed, but the less likely mostly-common data will match. Likewise, the smaller the chunk size, the more it is likely that matches will be found, but there are more chunks to be managed.
Most common data deduplication techniques use a fingerprinting scheme to break data into smaller chunks and then calculate a digest against the chunk to determine if it has been seen before. In order to deduplicate data, most schemes:                1. Track the digest value of each chunk so as new data is chunked and digested, a determination can be made to check for redundancy        2. Track the various chunks of each piece of data being tracked in the system, so that when the data is requested by its owner, the chunks can be reconstructed into the original order and returned to the owner.        
If a Storage Management System is storing its data into a deduplication system, the insertion of the meta data by the Storage Management System will greatly reduce the efficiency of the deduplication. The reason is that the meta data is distributed throughout the client file data and, thereby, reduces the likelihood of having common chunks. For example, two identical files of size 1 Megabyte (M) may not deduplicate much at all once the meta data is factored in. So, to increase deduplication characteristics, it is useful to not store the meta data with the file data, but not storing the meta data with the file data defeats the purpose of using meta data in the first place.
Alternatively, it is possible to separate the meta data from the file data and track the chunks independently. For example, if the Storage Management System were to separate file data from meta data, the Storage Management System may create a rudimentary database table that tracks each chunk and whether that chunk is file data. For example, each row in the table may have the following information:                Chunk id        Chunk digest value (for determining duplicates)        Chunk Type (meta data or file data)        Chunk Length        Chunk Location (where is the data stored)        
The idea is that each chunk, be it meta data or file data, is in the table, and the chunk id determines the order used to reconstruct the original data to send back.
The problem with this solution, however, is that the meta data chunks artificially segment the file data at undesired chunk boundaries. For example, assume that there are two 100M objects, and the second object is identical to the first object, with the exception of 10 different bytes at offset 100. If the meta data is inserted at discreet intervals, say every 128 kilobytes (128K), then this file will be chunked according to the fingerprinting technique, but also at meta data locations (i.e., every 128K in this example). In this example, assuming the first chunk consumes up to bytes 110, the first chunk will not match. But the data immediately before and after the 128K of meta data will be identical, so the remaining file data will be considered duplicate. However, instead of just changing those 10 bytes at offset 100, assume that some number of bytes are inserted or removed (a more likely scenario). Now the data immediately before and after the meta data chunks will not be the same, so the data will not be considered duplicate. As a result, an insertion of simply one byte will cause the entire 100M to not match. Actually, the fingerprinting technique may find common data within a 128K section of each file (remember, the file data is segmented by the meta data), and this data will be deduplicated. But, if the chunk size average is 128K or higher, then this becomes less likely.
Thus, there is a need in the art for improved data deduplication with embedded meta data.