Technical Field
Aspects and embodiments relate to data storage, and more particularly to apparatus and methods for identifying redundant data stored in data storage.
Discussion
Given the costs associated with conventional tape libraries and other sorts of back-up storage media, storage system vendors often incorporate de-duplication processes into their product offerings to decrease the amount of required back-up media. De-duplication is a process of identifying repeating sequences of data and preventing or removing redundant storage of the repeating sequences of data. De-duplication is typically implemented as a function of a target device, such as a back-up storage device. The act of identifying redundant data within back-up data streams is complex, and in the current state-of-the-art, is conventionally solved using either hash fingerprinting or pattern recognition.
In hash fingerprinting, the incoming data stream first undergoes an alignment process (which attempts to predict good “breakpoints,” also known as edges, in the data stream that will provide the highest probability of subsequent matches) and then is subject to a hashing process (usually SHA-1 or SHA-2 in the current state-of-the-art). The data stream is broken into chunks (usually about 8 kilobytes-12 kilobytes in size) by the hashing process; each chunk is assigned its resultant hash value. This hash value is compared against a memory-resident table. If the hash entry is found, the data is assumed to be redundant and replaced with a pointer to the existing block of data already stored in a disk storage system; the location of the existing data is given in the table. If the hash entry is not found, the data is stored in a disk storage system and its location recorded in the memory-resident table along with its hash. Some examples that illustrate this mechanism can be found in U.S. Pat. No. 7,065,619 assigned to Data Domain and U.S. Pat. No. 5,990,810 assigned to Quantum Corporation. Hash fingerprinting is typically executed in-line, that is, data is processed in real-time prior to being written to disk.
According to pattern recognition, the incoming data stream is first “chunked” or segmented into relatively large data blocks (on the order of about 32 MB). The data is then processed by a simple rolling hash method whereby a list of hash values is assembled. A transformation is made on the hash values where a resulting small list of values represents a data block “fingerprint.” A search is then made on a table of hashes to look for at least a certain number of fingerprint hashes to be found in any other given stored block. If a minimum number of matches is not met, then the block is considered unique and stored directly to disk. The corresponding fingerprint hashes are added to a memory-resident table. Should the minimum number of matches be met, then there is a probability that the current data block matches a previously-stored data block. In this case, the block of disk storage associated with a matching fingerprint is read into memory and compared byte-for-byte against the candidate block that had been hashed. If the full sequence of data is equal, then the data block is replaced by a pointer to the physically addressed block of storage. If the full block does not match, then a mechanism that detects changed portions within the block is employed to determine a minimal data set within the block that needs be stored. The result is a combination of unique data plus references to a closely-matching block of previously-stored data. An example that illustrates this mechanism can be found in U.S. Patent Application US2006/0059207 assigned to Diligent Corporation. As with hash fingerprinting above, pattern recognition is typically executed in-line.