The present invention relates to data storage and communications and, more particularly, to data compression. A major objective of the present invention is to provide improved performance in adaptive data compression systems.
Data compression is the reversible re-encoding of information into a more compact expression. This more compact expression permits information to be stored and/or communicated more efficiently, generally saving both time and expense. A typical encoding scheme, e.g., those based on ASCII, encode alphanumeric characters and other symbols into binary sequences. A major class of compression schemes encode symbol combinations using binary sequences not otherwise used to encode individual symbols. Compression is effected to the degree that the symbol combinations represented in the encoding scheme are encountered in a given text or other file. By analogy with bilingual dictionaries used to translate between human languages, the device that embodies the mapping of uncompressed code into compressed code is commonly referred to as a "dictionary".
The present invention is primarily applicable to dictionary-based compression schemes, which are part of a larger class of sequential compression schemes. These are contrasted with non-sequential schemes which examine an entire file before determining the encoding to be used. Other sequential compression schemes, such as run-length limited (RLL) compression, can be used in conjunction with the present invention.
Generally, the usefulness of a dictionary-based compression scheme is dependent on the frequency with which the symbol combination entries in the dictionary are matched as a given file is being compressed. A dictionary optimized for one file type is unlikely to be optimized for another. For example, a dictionary which includes a large number of symbol combinations likely to be found in newspaper text files is unlikely to compress effectively data base files, spreadsheet files, bit-mapped graphics files, computer-aided design files, Musical Instrument Data Interface (MIDI) files, etc.
Thus, a strategy using a single fixed dictionary might be best tied to a single application program. A more sophisticated strategy can incorporate means for identifying file types and select among a predetermined set of dictionaries accordingly. Even the more sophisticated fixed dictionary schemes are limited by the requirement that a file to be compressed must be matched to one of a limited number of dictionaries. Furthermore, there is no widely accepted standard for identifying file types, essentially limiting multiple dictionary schemes to specific applications or manufacturers.
Adaptive compression schemes are known in which the dictionary used to compress a given file is developed as that file is being compressed. Entries are made into a dictionary as symbol combinations are encountered in the file. The entries are used on subsequent occurrences of an encoded combination. Compression is effected to the extent that the symbol combinations occurring most frequently in the file are encountered as the dictionary is developing. Systems incorporating adaptive compression schemes can include means for cleaning the dictionary between files so that the dictionary can be adapted on a file-by-file basis.
Adaptive compression systems and methods are disclosed in U.S. Pat. No. 4,464,650 to Eastman et al. and U.S. Pat. No. 4,558,302 to Welch. These references explain further the use of dictionaries in both adaptive and non-adaptive compression strategies. Further pertinent references to compression strategies include: G. Herd, "Data Compression: Techniques and Applications - Hardware and Software Considerations, Wiley, 1983; R. G. Gallager, "Variations on a Theme of Huffman", IEEE Transactions on Information Theory, Vol. IT-24, No. 6, pp. 668-674, November 1978; J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, Vol. IT-23, No. 3, pp. 337-343, May 1977; J. Ziv and A. Lempel, "Compression of Individual Sequences via Variable Rate Coding", IEEE Transactions of Information Theory, Vol. IT-24, No. 5, pp. pp. 530-536, September 1978; and T.A. Welch, "A Technique for High Performance Data Compression", IEEE Computer, June 1984.
In an adaptive compression scheme, the degree of compression depends on the extent to which the portion of the file used to develop the dictionary resembles the remainder of the file. "Resembles" is used here to refer to a similarity in symbol-combination frequency distributions. However, especially with certain long files, the frequency distribution of symbol combinations can shift dramatically over the length of a file. For example, a financial report beginning with a verbal description of a company and its performance and concluding with primarily tabular numeric data would not be compressed optimally when the dictionary was completed before the numeric tables were encountered.
Accordingly, a compression scheme is desired which provides the advantages of adaptive compression schemes but yields improved performance for long files with changing frequency distributions of symbol combinations. Such a scheme should be adapted to communications and storage systems without requiring special file type codes so that the scheme can be applied effectively to a great variety of file types.