Data de-duplication systems may parse an incident data stream into blocks. Data de-duplication systems may also compute an identifier for a block. The identifier may be referred to, for example, as a fingerprint. Data de-duplication systems may hash blocks to produce hash values that may serve as the identifier/fingerprint. Conventional data de-duplication approaches may parse a larger block of data into smaller blocks of data and then produce hopefully unique fingerprints for the blocks. The fingerprints are only “hopefully” unique because when the fingerprint is produced using a hash function there may be a possibility of a hash collision. In some conventional systems, parsing the larger block into smaller blocks may include finding block boundaries using a rolling hash. Unique blocks may be stored in a block repository.
This general picture of conventional data de-duplication may be complicated by special cases encountered while ingesting an incident data stream, while producing output blocks or information about output blocks, while storing blocks, and at other times. One example complication involves an artificial maximum block size that may need to be imposed to prevent pathological behavior during ingest. Reaching a maximum block size may force a block boundary to be placed even though the rolling hash did not indicate a desired block boundary. Another complication arises when an incident data stream includes both metadata that is not to be de-duplicated and data that is to be de-duplicated. This complication may be handled by removing application metadata from the incident data stream before parsing the data stream. Yet another complication may arise when an incident data stream is presented out-of-order. Processing the out-of-order data stream may create potentially transient, sparse holes in an output data stream. Another complication may arise when data that has already been stored at a given stream offset may inadvertently be overwritten.
Academic and theoretical discussions of de-duplication systems may not consider the realities of these complications and thus may not present a completely accurate picture of the disjoint, incomplete, and/or out-of-order outputs that may be present in data de-duplication systems and the additional processing performed to handle edge cases produced by these complications.
Fingerprinting a block may include performing a block-wide hash. Parsing a block into potentially smaller blocks and then fingerprinting the potentially smaller blocks using, for example, the MD5 hash, facilitates storing unique blocks and not storing duplicate blocks. Instead of storing duplicate blocks, smaller representations of stored blocks can be stored in file representations, object representations, and other data representations. Thus a de-duplication system ends up treating what was a single larger item (e.g., file, object) as a collection of smaller pieces (e.g., blocks). Additionally, de-duplication systems generally produce references (e.g., pointers) to the smaller pieces.
Conventional de-duplication systems achieve significant reductions in the storage footprint of data by storing just the unique blocks, storing the smaller representations of duplicate blocks, and generating organized collections (e.g., lists) of references (e.g., pointers) to sets of unique blocks needed to recreate an original item. To consume an even smaller data footprint, conventional de-duplication approaches may compress the potentially smaller blocks after they have been parsed out of the larger block. For security reasons, conventional de duplication approaches may encrypt blocks and/or information about blocks.
Once unique blocks and information about the unique blocks have been stored, it may be straightforward to recreate items represented by collections of unique blocks. However, data de-duplication systems may also synthesize new items from existing items. For example, a data de-duplication system that contains a full backup and a set of incremental backups may be able to create a synthetic full backup by splicing together pieces of the full backup, pieces of the incremental backup, and perhaps new data. Data de-duplication systems may create synthetic backups without reading and writing all the data needed for the backup. Instead of reading all the data from the various locations in which it is stored and then writing all the data to a single location, systems may create a synthetic backup by creating a new set of pointers or references to stored things. This may reduce and in some cases even eliminate reading and writing during synthetic backup creation.
While synthetic backups have been described academically and/or theoretically, the realities described above that complicate data de-duplication (e.g., out-of-order ingest, compression, encryption) may similarly complicate synthetic backups. While synthetic backups are described, more generally data de-duplication systems may support fabricating a new entity from components of existing entities. Even more generally, systems that treat data as a collection of smaller pieces and that include references to those smaller pieces may be susceptible to sub-optimal results due to the realities described above. The sub-optimal results may occur at the time of fabricating the new entity.