Data duplication is a common problem. As an example, numerous computer users have the same applications installed on their computers. In addition, when emails and attachments are forwarded, different users end up storing copies of the same emails and attachments. As computing and storage become more centralized, servers increasingly store the same data for many different users or organizations.
Furthermore, many different applications such as data archival require the servers to maintain multiple copies of largely identical data. If the duplicate information can be identified and eliminated, the cost of storing such duplicated information could be saved. In addition, if identical data in e-mails, attachments, or other similar objects that are transmitted over the Internet can be identified, less data can be transmitted, thus reducing the bandwidth required to send information over the Internet. For Internet businesses investing in hardware and infrastructure to transmit large amounts of data, the savings in eliminating duplicate information could be significant.
Certain conventional approaches to identifying duplicate data divide data into fixed-size units of data, or chunks, and check whether any such chunks are identical. Identical data, however, may not be stored at the same offset within the fixed-size chunks.
FIG. 1 illustrates a conventional approach to dividing or chunking data commonly referred to as the “blocking approach”. With this approach, data stream 105 is divided into consecutive equal size chunks that are illustrated, for example, by chunks 110, 115, 120, 125, and 130. Each chunk represents a specific number of bytes of data.
However, this approach is not capable of handling the case where data is inserted in, or removed from the middle of a data stream, as shown when data 145 is inserted in chunk 115 to obtain the new data stream 135. When data 145 is inserted in chunk 115, the data of chunks 115, 120, 125, and 130 is displaced such that the chunks 150, 155, 160, 165, 170 of data stream 135 are somewhat similar, but are not identical to the chunks of data stream 105. When data 145 is inserted in chunk 115, the data of chunks 115, 120, 125, and 130 is displaced. Chunk 170 is an additional chunk that contains the data shifted from the end of chunk 130 by the introduction of data 145.
The data chunks are also affected if data is deleted from the middle of a data stream; wherein all the data after the modified chunk is displaced. The data after the modified chunk is identical to the corresponding data in the original stream, but the offset within the chunk is slightly different, so duplicates in the data cannot be identified. Consequently, although the data may be identical, very few chunks within the data are identical.
Another conventional approach to dividing data, namely data-based chunking or content-based chunking, identifies specific patterns or markers in the data, and identifies chunk boundaries based on those patterns. The marker selected for chunking may be any pattern as long as the same pattern is used for all the chunks. The marker may be a sequence of bytes such that some mathematical function of the data results in a certain bit pattern or it may be as simple as a full stop or a period. For example, each period in the data defines a chunk boundary. If periods are used as markers, the data is chunked into sentences.
FIG. 2 illustrates the data based chunking approach. Markers within data stream 205 are illustrated by markers 210, 215, 220, 225, 230. Markers 210, 215, 220, 225, 230 are used to divide data stream 205 into unequal sized chunks 235, 240, 245, 250, 255, 260. When data 270 is inserted in data stream 205 to obtain data stream 265, the data after the insertion point is displaced as before.
However, because the chunking is based on markers, chunks are displaced by the same amount as data 270. While chunk 275 has changed from chunk 240, the correspondence between chunks 245 and 285, chunks 250 and 290, chunks 255 and 295, and chunks 260 and 297, can still be identified.
Consequently, data-based chunking can identify many more duplicate chunks than the previous approach. However, it creates chunks with a wide variation in sizes, thus increasing processing and storage overhead and limiting the potential savings in storage. It becomes difficult to locate data when using data-based chunking. In addition, the selected marker may not appear in the data being chunked. As an example, the marker may be a period, when the document being chunked uses semicolon instead of a period. Consequently, data based chunking may also miss a significant number of duplicates.
What is therefore needed is a system, a service, a computer program product, and an associated method for dividing data into chunks that are predominantly of a predetermined size such that a large percentage of duplicate data may be identified and managed. Consequently, disk space for storing data may be reduced and bandwidth for transmitting data may be reduced. The reliability of data storage and network transmission may also be increased because if an error occurs, an identified duplicate can be used. The need for such a solution has heretofore remained unsatisfied.