Data de-duplication systems continue to practice new methods for identifying duplicate blocklets of data. These methods share the property that either incoming blocklets or information about incoming blocklets is compared to stored blocklets or information about stored blocklets to determine whether an incoming blocklet is unique or is a duplicate. While impressive gains have been made in duplicate determinations, which have led to improved efficiency in data reduction, additional improvements may be desired.
Simple patterns may appear in data. For example, a document may be padded with a run of space characters while a data stream may include a long run of all-zero bytes. Simple patterns may include contiguous runs of repeating single characters (e.g., AAAAAA . . . A), may include contiguous runs of repeating pairs of characters (e.g., ABABAB . . . AB), or may include contiguous runs of even larger repeating groups of characters (e.g., ABCDABCDABCD . . . ABCD). While characters are described, more generally the repeating item may be a value (e.g., bit, byte). In photographs there may be long runs of repeating codes associated with a color (e.g., sky blue) that appears frequently in a photograph. Depending on the type of data, different patterns may be common. For example, sparse files may be padded with all zero patterns.
Data compression and data de-duplication are both concerned with reducing the space required to store data. One well known data compression algorithm detects long runs of characters using a byte-wise scan and then replaces the long runs of characters with, for example, an identifier and a count. This is known as run-length encoding. Unfortunately, performing byte-wise scans can be computationally expensive.
Conventional data de-duplication approaches may parse a larger block of data into smaller blocklets of data and then produce hopefully unique fingerprints for the blocklets. The fingerprints are only “hopefully” unique because when the fingerprint is produced using a hash function there may be a possibility of a hash collision. In some conventional systems, parsing the larger block into smaller blocklets may include finding blocklet boundaries using a rolling hash. In some examples, the presence of a repeating pattern (e.g., long run of zeroes) makes it less likely that the rolling hash will indicate a boundary and more likely that a maximum blocklet size will be reached. A maximum blocklet size is typically imposed to prevent pathological behavior in a data de-duplication system. Reaching a maximum blocklet size may force a blocklet boundary to be placed even though the rolling hash did not indicate a desired blocklet boundary. The presence of repeating patterns in the block may lead to low data entropy. The lower the entropy of the data, the less likely that a conventional rolling hash will find a boundary in the data and the more likely that a maximum sized blocklet will be produced. “Entropy”, as used herein, refers to a measure of uncertainty associated with the randomness of data in an object. The entropy of data that is truly random is one. The entropy of a long string of duplicate characters is nearly zero. The entropy of most data falls between these two limiting examples.
Fingerprinting a blocklet may include performing a blocklet-wide hash. One blocklet-wide hash that has been used is an MD5 (Message Digest Algorithm 5) hash. Parsing a block into blocklets and then fingerprinting the blocklets using, for example, the MD5 hash, facilitates storing unique blocklets and not storing duplicate blocklets. Instead of storing duplicate blocklets, smaller representations of stored blocklets can be stored in file representations, object representations, and other data representations. Conventional de-duplication systems already achieve significant reductions in the storage footprint of data, including pattern data, by storing just the unique blocklets and storing the smaller representations of duplicate blocklets. To consume an even smaller data footprint, conventional de-duplication approaches may compress blocklets after they have been parsed out of the larger block. However, the compression may once again include a computationally expensive byte-wise scan that looks for opportunities to perform run-length encoding.
Identifying a contiguous run of repeating characters provides an opportunity to perform compression using, for example, run length encoding. Identifying a run of repeating characters may also provide other opportunities, for example, for determining the starting or ending location of a sparse region of a file. However, as described above, conventional systems tend to find these contiguous runs of repeating characters either by performing a computationally expensive byte-wise scan or by comparing a received blocklet to a stored blocklet. Greater efficiencies are desired.