In a digital data storage or communication system, it is desirable to reduce the size of stored or transmitted data so that it requires less space. The space occupied by a body of digital data is most frequently measured in eight-bit bytes, where each bit is a fundamental unit of information representing a single binary decision. Although the invention is described for eight-bit bytes, the generalization to larger or smaller byte sizes will be obvious to those skilled in the art. The effectiveness of a compression method is measured by its compression ratio, the ratio of the number of bytes in the data's uncompressed representation to the number of bytes in its compressed representation.
There are a multitude of methods for compressing data, but only a few are amenable to general data storage/transmission. A system for general data storage/transmission employs a method that satisfies four criteria: general-purpose, single-pass, adaptive, and lossless.
By general-purpose it is meant that the compression method operates effectively on any type of data, as opposed to a special-purpose method which performs well only on data of special content or in a special format. To meet the general-purpose criterion, some compression methods require a preliminary pass over the data for gathering statistics. These compression methods may be unusable in real-time data storage/transmission environments because of an unacceptable amount of delay; for real time applications, a compression method usually must meet the single-pass criterion.
A compression method for data storage/transmission that is general-purpose and single-pass is almost always adaptive. An adaptive compression method is one which conforms to the particular redundant characteristics of the input data while it compresses the data.
Finally, a lossless compression method is desired; a method with which the decompressed data is an exact replica of the original data before compression, not just a close facsimile. Compression methods which are not lossless throw out the fine details which are sometimes unimportant when the data is decompressed. The determination of what constitutes an unimportant detail relies upon some prior knowledge of the data and clearly is inappropriate for general-purpose data compression.
The compression methods which satisfy the four criteria fall into two categories: statistical and substitutional. Prior art statistical compression methods include adaptive versions of the Huffman code and the arithmetic code. A description of Huffman codes is found in "A Method for the Construction of Minimum-Redundancy Codes," Proceedings of the IRE, vol. 40, Sept. 1952, pp. 1098-1101. A description of arithmetic codes is found in "An Introduction to Arithmetic Coding,"/BM Journal of Research and Development, vol. 28, no. 2, pp. 135-149. Both Huffman and arithmetic codes operate by calculating, if only in effect, the probability of each received byte (or bit) in accordance with some probabilistic model of the input data. The probabilistic model is typically a Markov model. An order N Markov model examines the context, consisting of the previous N-1 bytes, and estimates the likelihood of the byte in question. To achieve a high compression ratio, N must be large. The problem with statistical data compression methods is that with increasing N, the amount of working memory required to store the model grows exponentially. The amount of working memory required for the compression ratios desired in a data storage/transmission system is prohibitively large for many applications.
Substitutional compression methods achieve a compact representation of data by replacing input bytes or strings of input bytes with special tokens. A token is a sequence of bits, typically of variable length, which a compressor puts into the compressed output data as a replacement or substitution for one or more uncompressed input bytes. The decompressor must be able to recognize these tokens and parse them from the compressed data as it decompresses. These tokens instruct the decompressor as to how to reconstruct the original, uncompressed data. Run-length coding is perhaps the best known example of a substitutional data compression method. In run-length coding, runs of identical bytes are replaced with an identifier of the byte and a count of the number of times the byte is repeated. Although a good example of substitutional data compression, run-length coding is suited only for special types of data and not for general data transmission/storage. A more sophisticated substitutional method which satisfies the four criteria is LZ1, also known as LZ77.