Data can be stored and data can be transmitted. Storing data takes time and space while transmitting data takes time and bandwidth. Both storing and transmitting data cost money. Yet more and more data is being generated every day. Indeed, the rate at which the amount of data is expanding may be exceeding the rate at which storage space and transmission bandwidth are growing. Furthermore, while the amount of data to be stored and/or transmitted is growing, the amount of time available to store and/or transmit data remains constant. Therefore, efforts have been made to reduce the time, space, and bandwidth required to store and/or transmit data. These efforts are referred to as data reduction. Data reduction includes data deduplication, data protection, and data management. Data deduplication may be referred to as “dedupe”.
Data reduction for data storage initially relied on the fact that a larger piece of data can be represented by a smaller fingerprint. The fingerprint can be, for example, a hash. By way of illustration, a 1 K block of data may be uniquely identified by a 128 bit cryptographic hash. Sophisticated techniques for computing hashes have been developed. Data reduction for data storage also relied on the fact that much data that is stored has already been stored. If data has already been stored, then it does not have to be stored again. Instead of storing a copy of a block of data that is already stored, a record that identifies and facilitates locating the previously stored block can be stored. The record can include the fingerprint and other information. Data reduction involves both breaking a larger piece of data into smaller pieces of data, which can be referred to as “chunking”, and producing the unique identifier, which can be performed by hashing.
Conventionally, determining whether a chunk of data has already been stored included comparing chunks of data byte-by-byte. After dedupe chunking and hashing has been performed, determining whether a chunk of data has been stored could also include comparing fingerprints (e.g., hashes) instead of comparing chunks of data byte-by-byte. Comparing 128 bit hashes can be more efficient than comparing chunks (e.g., 1k, 128k) of data byte-by-byte. Therefore, data reduction for data storage can involve chunking larger pieces of data into smaller chunks, computing fingerprints (e.g., hashes) for the smaller chunks, and then comparing fingerprints. Comparing fingerprints can involve indexing the fingerprints to facilitate their retrieval and searching. However, indexing should not consume so much additional memory that an inordinate amount of the space saved through data reduction is spent on indexing.
Data reduction for data transmission also initially relied on the fact that a large piece of data that can be represented by its smaller fingerprint can, in effect, be transmitted by transmitting the fingerprint to a system that already has the large piece of data and an indexed fingerprint for the large piece of data. Data reduction for data transmission also relied on the fact that much data that is transmitted has already been transmitted. Once again, representing a large piece of data using a fingerprint, and determining whether a certain fingerprint has been seen before both involve chunking and fingerprinting (a.k.a. chunking and hashing), and indexing.
Data reduction can include dedupe. Dedupe can be applied to aggregations of data (e.g., files) that can be partitioned into smaller parts (e.g., chunks). An aggregation of data can be referred to more generally as an object. Conventional dedupe has included identifying boundaries between chunks of data and computing a hash for the data between the chunk boundaries. Comparing chunk hashes facilitates determining whether a chunk has been previously stored and/or transmitted. If the chunk has already been stored, then there is no need to store it again, there is only a need to record the fact that the chunk is stored and where it is stored. If the chunk has already been transmitted, and if it was stored at the receiving site, then there is no need to transmit the whole chunk again. There is only a need to record the fact that the chunk was stored at the receiving site and where it was stored. Determining whether a chunk has been previously stored and/or transmitted involves comparing fingerprints. Efficiently comparing fingerprints involves efficiently finding fingerprints, which involves prior attention to indexing.