Some data compression techniques, used in both data storage and network compression technologies, are based on so called stateless algorithms such as gzip or zlib. Gzip and zlib are software implementations of variants of the Lempel-Ziv algorithm. While these algorithms have some desirable characteristics (such as the fact that the decompressor does not require knowledge of a dictionary), they provide poor compression for several common data types since they ignore long-term temporal correlations between the data. For example, a Microsoft Word® file is edited incrementally and a new copy is saved or sent over a network. The new version closely resembles the old version. A stateless compression algorithm, however, may not take advantage of the redundancy between the two copies.
Certain replication oriented products do provide more efficient means of transmission by using so called differencing techniques. A disadvantage of differencing techniques is that they typically only work on files and not streams of data, which have no clear beginning or end, that typically need to be processed online as bytes are seen rather than offline when a whole file is available. Another disadvantage of differencing techniques is that they need to know a priori the “basis” against which to difference (e.g. if a file is renamed it can not be differenced properly). Some improvements for the basis problem are available such as a database of files used to locate suitable candidates or an extended file can be created at a known location to serve as an additional basis for storing recently seen blocks. However, neither technique works on streams. Additionally, neither helps with optimizing local storage of data.
In view of the foregoing, a need exists in the art for the aforementioned deficiencies.