1. Technical Field
The present invention relates to data processing and, in particular, to size reduction of files or objects in a data processing system. Still more particularly, the present invention provides a method, apparatus, and program for data redundancy elimination at the block level.
2. Description of Related Art
Despite increasing capacities of storage systems and network links, there are often benefits to reducing the size of file objects that are stored and/or transmitted. Examples of environments that would see such benefits include mobile devices with limited storage, communication over telephone links, or storage of reference data, which is data that is written, saved permanently, and often never again accessed. Other examples include wide-area transfers of large objects, such as scientific data sets, or over saturated links. The present invention is concerned with self-contained storage systems, in which all data is stored in a single location. Data can take the form of files in a file system, objects in a database, or other storage, and the terms “object,” “file,” and “file object” are used interchangeably in this document.
Numerous techniques for reducing large object sizes exist including data compression, duplicate suppression, and delta encoding. Data compression is the elimination of redundancy internally within an object. Duplicate suppression is the process of eliminating redundancy caused by identical objects. Delta encoding eliminates redundancy of an object relative to another object, which may be an earlier version of the object having the same name.
Another technique involves a method for dividing larger objects into smaller, variable-sized “chunks” and eliminating duplicate chunks. The boundaries of the chunks may determined, for example, using a function called a Rabin fingerprint over a sliding window of the content. The Rabin fingerprint is only one such solution and other techniques may be used to efficiently and deterministically hash the content. Such content-defined blocks isolate changes within an object, so that changes in one part of an object do not affect other parts and duplication of blocks of content across objects can be detected. This technique was first proposed for the low-bandwidth file system (LBFS) and has since been applied to other systems.
FIG. 1 illustrates a process of dividing objects into blocks and duplicate blocks. File object 110 is divided into a plurality of content-defined blocks or “chunks” 120. The process then compares the chunks and removes exact matches, replacing each duplicate chunk with a reference to an identical chunk 132 in the plurality of chunks 130. By replacing duplicate chunks with references to identical chunks, the size of the file object is reduced. However, the LBFS technique requires chunks to be identical. If one byte is different between two chunks, no benefit is realized between those two chunks. Thus, an object that has many minor changes scattered throughout may see no improvements whatsoever from the LBFS technique.
Yet another technique is known as delta encoding via resemblance detection (DERD). This technique attempts to extend delta encoding by identifying similar objects that may otherwise have no association, either spatial or temporal, with the object being encoded. The technique then performs delta encoding of the object against a chosen similar object. The resemblance detection step typically uses Rabin fingerprints to compute a set of values based on the contents of the object and then deterministically select a small number of these values to represent each object. Two objects with many of these fingerprints in common are likely to have much of their content in common overall.
FIG. 2 illustrates a process of delta encoding via resemblance detection. File objects 202 and 204 share common content. The process performs resemblance detection and delta encodes object 204 relative to object 202. The process results in a delta encoded file object 214. By replacing objects with delta encoded objects, a reduction in object size is realized. However, the performance of the DERD technique does not scale well with large datasets, because the resemblance detection step uses a quadratic algorithm which can suffer a performance penalty with large datasets. DERD also does not detect multiple objects when the objects match different parts of the encoded object. For example, if file A consists of the concatenation of files B-Z, then each file B-Z closely resembles a portion of file A; however, DERD is not capable of detecting the resemblance, because the fingerprints of each file B-Z would intersect only a small portion of the fingerprints of file A.
Another technique is an optimization of the Rsync protocol. Rsync allows two versions of a file to be synchronized across a slow link by sending hashes of blocks of content and identifying when one copy has the same blocks as the other, possibly offset between the two copies. A multi-round version of Rsync has been devised, which tries large blocks and then decomposes them into smaller blocks to find pieces that are similar enough to delta encode.