In recent years there has been a problem of an increasing number of stored electronic documents that have identical or virtually identical content. To deal with this problem, data de-duplication techniques have been developed for reducing the data storage requirements of virtually identical files. These data de-duplication techniques determine file segments that are identical among virtually identical files, so that the data content of each shared file segment need be stored only once for the virtually identical files. The shared data content is placed in a common storage area, and each identical segment is removed from each of the virtually identical files and replaced with a corresponding link to the shared data content.
For example, a data de-duplication application identifies redundant data in pooled storage capacity and replaces it with one or more pointers pointing to a single instance of the data. The de-duplication application can operate on fixed or variable-size blocks of data and can de-duplicate data either post-process or on-line. See Yueh U.S. Pat. App. Pub. 2009/0063795 A1 published Mar. 5, 2009, incorporated by reference.
In recent years there has also been increasing use of data compression techniques to store data more efficiently. Data compression techniques have been well known for reducing redundancy in data for more efficient archival storage and for more efficient transmission over a limited-bandwidth channel. More recently data compression techniques have been applied generally to the on-line storage of infrequently accessed files. A wide variety of data compression techniques are available depending on the type of data to be stored. There are also a number of well-known loss-less data compression techniques that have universal applicability to all kinds of data.
A popular loss-less data compression technique is the Lempel-Ziv procedure, which achieves data compression by replacing repeated occurrences of data with references to a dictionary that is built based on the input data. The basic procedure was published in Jacob Ziv and Abraham Lempel; A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3), pp. 337-343, May 1977. Variations of the Lempel-Ziv procedure are further described in Eastman et al. U.S. Pat. No. 4,464,650 issued Aug. 7, 1984, incorporated herein by reference; Lempel et al. U.S. Pat. No. 5,373,290 issued Dec. 13, 1994 incorporated herein by reference; and Natanzon U.S. Pat. No. 7,719,443 issued May 18, 2010 incorporated herein by reference.