Data de-duplication systems generally include a repository of unique blocklets, an index or indexes for accessing the repository of unique blocklets, and representations of items made up from the unique blocklets. Data de-duplication systems also generally include processes for ingesting received data and for determining whether the ingested data includes new unique blocklets and/or duplicate (e.g., already stored) blocklets. If the ingested data is not found in the repository, then the ingested data may be added to the repository and the index may be updated with information about the ingested data that was added to the repository. If the ingested data is found in the repository, then a reference to the copy of the data may be used to refer to the data and the ingested data may be discarded. Naturally, when a system is just put into use, practically all data ingested will be new data and little, if any, de-duplication will occur. While data ingestion is described, similar issues arise in a replication environment where data from a first location or device is being replicated at a second location or device.
During ingest, data de-duplication systems typically parse larger blocks of data into smaller blocklets of data and then populate the repository with unique blocklets and populate the index(es) used to access the repository with information about the unique blocklets. In some conventional systems, parsing larger blocks into smaller blocklets may include finding blocklet boundaries using a rolling hash and making duplicate determinations for every parsed blocklet. The duplicate determination may include producing an identifier (e.g., fingerprint) for a blocklet. The identifier may be, for example, a blocklet-wide hash (e.g., MD5 (Message Digest Algorithm 5)). Parsing a block into blocklets and then fingerprinting the blocklets using, for example, the MD5 hash, facilitates storing unique blocklets and not storing duplicate blocklets. Instead of storing duplicate blocklets, smaller representations of stored blocklets can be stored in file representations, object representations, and other data representations. When a system is new or relatively immature, practically every piece of data will be treated as a unique blocklet.
Conventional de-duplication systems already achieve significant reductions in the storage footprint of data, including pattern data, by storing just the unique blocklets and storing the smaller representations of duplicate blocklets. However, these significant reductions may only occur after a break-in period where the repository of duplicate blocks is built up. In addition to reducing the storage footprint for data, de-duplication systems may also be used to reduce the amount of data that is transmitted between devices (e.g., computers). De-duplication systems may be used to reduce data traffic to just unique data and information about that unique data. Conventional de-duplication systems already achieve significant reductions in the transmission footprint of data by making it possible to transmit only unique data from one location to another location.
One issue may arise because conventional data de-duplication systems may all “start from scratch”. Thus, both the index and/or the repository may initially be empty. As opposed to being “empty”, a repository and index may be in a sub-optimal state when a repository is immature, when data being stored or transmitted is significantly different than the data that has been previously processed, when the working set of data being processed is too large for the system to handle effectively and thus ‘older’ data becomes more expensive or impossible to de-duplicate against, or for other reasons. When an index has less than complete knowledge of the blocklets in the repository, then using the index may be very expensive with little, if any, return on the investment. Similarly, when a repository has few relevant unique blocklets, then looking for duplicate blocklets in the repository may also be very expensive with little return on the investment.
Whether being used to reduce the amount of data stored or to reduce the amount of data transmitted, de-duplication may be relatively ineffective and even counter-productive until a relevant reference pool is built up in the repository or repositories and until knowledge about the relevant reference pool is acquired and made accessible. Unfortunately, filling a repository to a useful level may be expensive in terms of bits transmitted across a network, processor time spent analyzing blocklets, processor time spent populating an index, and other actions. Thus it may be a difficult decision to add data de-duplication to a computing environment. Compounding the difficulty of the decision making is the fact that different systems and different applications may have different break-even points and costs. For example, a de-duplication system that includes a repository in “the cloud” may be characterized by a high latency link to the repository, billable or expensive processor time, and billable or expensive memory usage, and thus may have a first break-even point determined by these characterizations. In another example, a de-duplication system that has extensive storage optimized for the repository may be characterized by a low latency link to the repository, non-billable or very inexpensive processor time, and non-billable or very inexpensive memory. Thus it may be difficult to predict when, if ever, a break-even point will be reached. All of these issues present barriers to entry for adopting data de-duplication systems.