Digital data compression is an important tool because it can be utilized, for example, to reduce the storage requirements for files, to increase the rate at which data can be transferred over bandwidth limited communication channels, and to reduce the internal redundancy of data prior to its encryption in order to provide increased security.
There are special purpose and general purpose data compression systems. Special purpose systems often are satisfactory when used to compress source data for which they have been optimized. However, general purpose systems customarily are designed to adapt to the source data, so they usually are better suited for compressing unknown or diverse source data types. Ideally, these general purpose systems are not only able to adapt promptly to fundamental changes in the compositional structure of the source data as may be required to provide significant compression for small files and for sources with internally inconsistent statistical characteristics, but also are able to provide near optimal compression for large files with stable statistical characteristics. Designers have taken various approaches to resolving these competing design goals, but the results of their efforts have not been fully satisfactory.
Shannon communication theory (C. E. Shannon, "A Mathematical Theory of Communication," The Bell System Technical Journal, Vol. XXVII, No. 3, 1948, pp. 379-423 and No. 4, 1948, pp. 623-656) indicates that the ideal encoding of a given source symbol uses space equal to -log.sub.2 of the probability, P, of the occurrence of the symbol. When the encoding conforms to this memoryless model, the average space needed to represent any symbol is the entropy of the source: ##EQU1## where: x is a randomly chosen symbol from a source containing n unique symbols; and
c.sub.i ranges over all possible source symbols.
D. A. Huffman, in "A Method for the Construction of Minimum Redundancy Codes," Proceedings of the I.R.E., Vol. 40, 1952, pp. 1098-1110, suggested mapping variable length codes onto the source symbols in accordance with the statistical frequency distribution of the symbols to provide a discrete approximation of such ideal encoding. Thereafter, arithmetic coding techniques were developed to further optimize the encoding by arithmetically modifying the last few bits of the previously encoded symbols, thereby avoiding the waste of fractional bits. See, for example, R. C. Pasco, "Source Coding Algorithms for Fast Data Compression," Ph. D. Dissertation, Stanford University, 1976; G. G. Langdon, Jr. et al., "Compression of Black-White Images with Arithmetic Coding," IEEE Transactions on Communications, Com-29, No. 6, 1981, pp. 858-867; G. G. Langdon, Jr. et al., "A Double Adaptive File Compression Algorithm," IEEE Transactions on Communications, Com-31, No. 11, 1983, pp. 1253-1255; and J. Rissanen et al., "Universal Modeling and Coding," IEEE Transactions on Information Theory, IT-27 , No. 1, 1981, pp. 12-23.
As a practical matter, however, the zero-order entropy model of equation (1) fails to capture a significant part of the redundancy of many conventional sources. For example, English language text normally exhibits a substantial drop in first-order entropy: ##EQU2## where: xy is a randomly chosen pair of adjacent source characters.
Thus, several of the above-identified references have extended Huffman and arithmetic coding techniques by basing the coding on statistics that are conditioned by the frequency at which any given symbol is preceded by at least one and usually two or more other symbols. Unfortunately, however, the increased compression that is achieved in that way characteristically requires substantially more memory and processing time to carry out the compression process.
Others have proposed so-called "textual substitution" data compression processes for capturing the high-order coherence of text and similar source data, without having to pre-condition the capture mechanism on statistical probabilities. J. Ziv and A. Lempel proposed an algorithmic model for a textual substitution process based on the notion that a reoccurrence of a string of previously encoded symbols can be represented by prefacing the symbol immediately following such a reoccurrence (i.e., the suffix character of the reoccurring string) with a copy codeword which (1) points to one end (e.g., the lead end) of the prior occurrence of the string and (2) identifies the length of the reoccurring string. They recognized that such a copy codeword would fully and completely define the reoccurring symbol string, so they envisioned "substituting" the codeword for the symbols of the reoccurring symbol string to provide a compressed representation of them. See J. Ziv et al., "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, IT-23, No. 3, 1977, pp. 337-343.
Regrettably, compression systems based on the original Ziv-Lempel algorithm tend to be unacceptably slow and have not achieved particularly high compression. To improve speed, they and other workers in the field have developed various alternatives. Some of these alternatives adopted artificially rigid parsing mechanisms to simplify the generation of the copy codewords and to limit the size of the data structures that are needed to carry out the compression. See, for example, J. Ziv et al., supra; J. Ziv, "Coding Theorems for Individual Sequences," IEEE Transactions on Information Theory, IT-24, No. 4, 1978, pp. 405-412; J. Ziv et al., "Compression of Individual Sequences Via Variable-Rate Coding," IEEE Transactions on Information Theory, IT-24, No. 5, 1978, pp. 530-536; and W. L. Eastman et al., U.S. Pat. No. 4,464,650, which issued Aug. 7, 1984, on "Apparatus and Method for Compressing Data Signals and Restoring the Compressed Data." However, these modified Ziv-Lempel style data compression systems have not been fully satisfactory because their performance typically is disappointing as measured by the speed at which they adapt and/or the compression they provide.
A somewhat different approach that has been proposed for utilizing textual substitution for data compression is based upon building adaptive lists or dictionaries of individual symbols and symbol strings. See, for example, V. S. Miller et al., "Variations on a Theme by Ziv and Lempel," IBM Research Report, RC 10630, #47798, 1984, Combinational Algorithms on Words, NATO, ASI Series F, Vol. 12, 1985, pp. 131-140; T. A. Welch, "A Technique for High Performance Data Compression," IEEE Computer, Vol. 17, No. 6, 1984, pp. 8-19; and J. L. Bonkley, "A Locally Adaptive Data Compression Scheme," Communications of the ACM, Vol. 29, No. 4, 1984, pp. 320-330. But, these systems are slow to adapt and achieve inferior compression.
Still further background on the textual substitution digital data compression art is provided, for example, by M. Rodeh et al., "Linear Algorithm for Data Compression Via String Matching," Journal of the Association for Computing Machinery, Vol. 28, No. 1, 1981, pp. 16-24; J. A. Storer, "Data Compression Via Textual Substitution," Journal of the Association for Computing Machinery, Vol. 29, No. 4, 1982, pp. 928-951; G. Guoan et al., "Using String Matching to Compress Chinese Characters," Stanford Technical Report, STAN-CS-82-914, 1982; and G. G. Langdon, Jr., "A Note on the Ziv-Lempel Model for Compressing Individual Sequences," IEEE Transactions on Information Theory, IT-29, No. 2, 1983, pp. 284-287.
In view of the disadvantages of the prior art, it will be apparent that there still is a need for practical general purpose, adaptive and invertible (i.e., lossless) data compression systems for reliably and efficiently compressing large sources having stable statistical characteristics, as well as less extensive sources and sources having variable statistical characteristics. Textual substitution techniques would be well suited to that task, but improved methods and means for carrying out such a data compression process in practice are needed to more fully realize its potential.