Data can be stored and data can be transmitted. Storing data takes time and space while transmitting data takes time and bandwidth. Both storing and transmitting data cost money. Yet more and more data is being generated every day. Indeed, the rate at which the amount of data is expanding may be exceeding the rate at which storage space and transmission bandwidth are growing. Furthermore, while the amount of data to be stored and/or transmitted is growing, the amount of time available to store and/or transmit data remains constant. Therefore, efforts have been made to reduce the time, space, and bandwidth required to store and/or transmit data. These efforts are referred to as data reduction.
Data reduction includes data deduplication, data protection, and data management. Data deduplication may be referred to as “dedupe” or “deduping”. Dedupe for data storage initially relied on the fact that a larger piece of data can be represented by a smaller piece of data called a fingerprint. The fingerprint can be, for example, a hash. By way of illustration, a 1 kilobyte (KB, 1×103 bytes) block of data may be uniquely identified by a 128 bit cryptographic hash. Sophisticated techniques for computing hashes of different widths and with different security have been developed. These techniques have been applied to fixed and variable sized blocks of data.
Dedupe for data storage also relied on the fact that much data that is stored has already been stored. If data has already been stored, then it does not have to be stored again. Instead of storing a copy of a block of data that is already stored, a record that identifies and facilitates locating the previously stored block can be stored. The record may be significantly smaller than the block of data. The record can include the fingerprint hash and other information. Data reduction involves both breaking a larger piece of data into smaller pieces of data, which can be referred to as “chunking”, and producing the unique identifier, which can be performed by hashing. Reaping the rewards of clever chunking and hashing depends on being able to accurately and efficiently locate chunks of data and hashes for the chunks of data.
Conventionally, determining whether a chunk of data was a duplicate included comparing chunks of data byte-by-byte. To compare a chunk, it had to be accessible, which may have required an input/output (I/O) operation. I/O operations can be a bottleneck in certain types of processing. After dedupe chunking and hashing, determining whether a chunk of data is a duplicate could include comparing fingerprints (e.g., hashes) instead of comparing chunks of data byte-by-byte. Comparing 128 bit hashes, or other sized hashes, can be more efficient than comparing chunks (e.g., 1 KB, 128 KB) of data byte-by-byte. Like comparing blocks requires the blocks to be available, comparing fingerprints requires having the fingerprints available. Having a fingerprint available involves finding a fingerprint. Finding a fingerprint depends on intelligent indexing. However, indexing should not consume so much additional memory that an inordinate amount of the space saved through data reduction is spent on indexing.
Data reduction for data transmission also initially relied on the fact that a large piece of data that can be represented by its smaller fingerprint can, in effect, be transmitted by transmitting the fingerprint to a system that already has the large piece of data and a way to correlate the fingerprint to the large piece of data. The correlation has typically been made by storing fingerprints in a single global index, which may have negatively impacted searching and retrieving fingerprints.
Dedupe can be applied to aggregations of data (e.g., files) that can be partitioned into smaller parts (e.g., chunks, sub-chunks). The smaller parts may be arranged into a hierarchy. A hash can be computed and stored for each unique block seen by a dedupe process or apparatus. Determining whether to store and/or transmit a just processed chunk depends on determining whether the chunk has been previously hashed, stored, and/or transmitted. If the chunk has already been stored, then to store the aggregation there is no need to store the duplicate chunk again, there is only a need to record with the structure from which the aggregate will be recreated the fact that that the chunk is stored and where the chunk is stored. If the chunk has already been transmitted, and if the chunk is stored at the receiving site, then there is no need to transmit the whole chunk again. Uniquely identifying information (e.g., fingerprint hash) can be transmitted thereby reducing the amount of data transmitted.
Determining whether a chunk has been previously hashed, stored, and/or transmitted involves comparing hashes. Efficiently comparing hashes involves efficiently finding hashes, which involves prior attention to indexing. As data sets grow ever larger, as work becomes ever more distributed, and as large files are more routinely stored, transmitted, and archived, issues associated with indexing hashes become larger and larger.
Consider the following transmission and storage nightmare scenario. Imagine twenty people are all collaborating on authoring and editing a scientific paper that has one hundred pages of text, twenty embedded formulae, ten embedded photographs, two embedded video segments, and fifteen slides. The entire paper is fifty megabytes (MB) in size. The paper is attached to an email and distributed to the entire group. If the entire file is actually transmitted to each person, then 20×50 MB of data (1,000 MB, 1 GB) is transmitted. If each person stores the 50 MB file, then 1 GB of data is stored. Just for the twenty copies of the identical file.
Now consider that different members of the group will mark up different versions of the paper, save their markup copy, and then email the marked up copy to individuals and/or to the entire group. This can happen over and over. Some of the marked up versions may only change the text in one controversial sentence, while other marked up versions may merely substitute one photograph for another photograph. As the deadline for publishing the paper draws near, the paper may be transmitted back and forth several times a day. Conventionally this would consume many gigabytes of transmission bandwidth and many gigabytes of storage. Now assume that on the day before publication the final version of the paper and all its predecessors is distributed to an entire organization having 1,000 people. The bandwidth required to transmit these mostly duplicate versions is enormous. The memory consumed storing the emails is also enormous. And yet most of the transmission bandwidth consumed and most of the memory consumed is consumed by duplicates. While the entire document may not be identical, substantial portions of the document may be unchanged from the original, and substantial portions may have become settled after one or two rounds of editing. The storage issue is multiplied when the emails are backed up or archived.
Data reducing facilitates reducing transmission bandwidth and storage space consumed, both primary, near-line, and archival. However, data reducing depends on determining whether all or some of an object that you are considering transmitting, storing, and/or archiving duplicates all and/or some of a previously transmitted, stored, and/or archived object. The decision concerning whether an object is a duplicate may be made inline when considering transmission and may be made during post processing when considering how to reduce the amount of data stored and/or archived. Regardless of when the decision is made, at some point a determination needs to be made concerning whether an object is a duplicate. Intelligent indexing of hashes and other information associated with stored data facilitates efficiently decided whether an object is a duplicate.