In the field of data processing systems, it is desirable to reduce the size of data files to conserve memory and to efficiently use available transmission bandwidth. This objective can be achieved by the use of a data compression system. Data compression generally refers to any technique for converting data in a given format into an alternative format having fewer characters or symbols than the original format. Data compression systems can encode a stream of data signals into compressed data signals and decode the compressed data signals to obtain the original stream of data signals.
A. Lempel and J. Ziv have described a method of compressing data (hereinafter the "L-Z" compression technique") based on a dictionary of character sequences that have already been encountered in an input data stream. When a sequence of characters is being compressed and that character sequence has already been encountered and stored in the dictionary, a compressor causes a reference value or token to be output to a coded file. The reference value identifies the string in the dictionary that is identical to the sequence of characters being compressed. In general, the number of bits required to identify the dictionary entry representing this sequence of characters is smaller than the number of bits that would have been required in the event that the entire character string had been output to the encoded file. In this manner, data is compressed by replacing sequences of characters with a reference value that identifies an entry in a dictionary. References for the L-Z compression technique include J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, IT-23, 3, pages 337-343 (May 1977) and J. Ziv and A. Lempel, "Compression of Individual Sequences via Variable-Rate Coding", IEEE Transactions on Information Theory, IT-24, 5, pages 530-537 (September 1977).
U.S. Pat. No. 4,464,650 to Eastman et al. describes an adaptive data compression system that parses the stream of input data symbols into segments, each segment comprising a prefix and the next data signal occurring in the input data stream following the prefix. The prefix comprises the longest match with a previous segment. A pointer signal is generated for each segment, the pointer signal pointing to the previous segment matching the prefix. The pointer signals generated for the respective segments of the input data stream form the compressed stream of digital code signals.
U.S. Pat. No. 4,558,302 to Welch describes a data compression technique that (1) stores strings of characters parsed from the input data stream and (2) searches the input data stream by comparing the stream to the stored strings to determine the longest match of a stored string. Each stored string comprises a prefix string and an extension character. The extension character is the last character in the stored string and the prefix string comprises all but the extension character. Each stored string has a corresponding code signal. When the longest match between the input data character stream and the stored strings is determined, the code signal for the longest match is transmitted as the compressed code signal for the encountered string of characters, and an extended string is stored in the string table. The prefix of the extended string is the longest match, and the extension character of the extended string is the next input data character following the longest match.
A characteristic of the above-described data compression techniques is that they operate on the input data stream of the data file to be compressed. These data compression techniques use a previously encountered portion of the file to be compressed to achieve data compression. However, for certain applications, the data file to be compressed, i.e., the source file, is a revised version of an original data file. In view of the revised nature of the source file, it will be appreciated that the source file typically contains segments of characters that are identical to segments in the original file. A high level of compression can be achieved by taking advantage of this natural characteristic of redundant characters in revised and original data files.
In view of the foregoing, there is a need for a data compression system that uses the original data file as a dictionary to support the encoding of a revised version of this original data file. There is also a need for a data compression system that achieves a high level of compression based on the natural redundancy that occurs between the characters of an original data file and the revised version of this original data file.