1. Field of the Invention
This invention relates to compression of data.
II. Related Art
It is highly desirable to compress data so that it can be efficiently stored and transmitted. Valuable bandwidth can be preserved and communication channels can be more efficiently used if the size of the data is reduced. Similarly, less memory is required to store compressed data than non-compressed data. Various different techniques such as run length encoding (for example, Ziv-Lempel and PK Zip), Huffman compression, and arithmetic coding can be used to compress data in such a way that data is not lost. These lossless techniques can be performed in conjunction with other algorithms that enhance compression, such as the Burrows-Wheeler transform.
A simple variant of run length encoding involves identifying one or more strings of data that are frequently repeated, such as the word “the”. Such frequently repeated data strings can be encoded using a coding element that is substantially shorter than the string itself. This technique and variants thereof can achieve up to approximately 4:1 compression of English text. More complex variants of run length encoding are also in common use. A major drawback to run length encoding is that the strings of data that are frequently repeated are not always known a priori, thus requiring the use of a pre-determined set of codes for a set of predetermined repetitive symbols. It may not be possible to achieve the desired degree of compression if the repetitive strings in the data do not match those included in the pre-determined set.
Huffman coding or variants thereof, is used in a variety of instances, ranging from Morse code, to the UNIX pack/unpack and compress/uncompress commands. Huffman coding and variants of Huffman coding involve determining the relative frequency of characters and assigning a code based upon that particular frequency. Characters that recur frequently have shorter codes than characters that occur less frequently. Binary tree structures are generated, preferably starting at the bottom with the longest codes, and working to the top and ending with the shortest codes. Although preferably built from the bottom up, these trees are actually read from the top down, as the decoder takes a bit-encoded message and traces the branches of the tree downward. In this way, the most frequently encountered characters are encountered first. One of the drawbacks to Huffman coding is that the probabilities assigned to characters are not known a priori. Generally, the Huffman binary tree is generated using pre-established frequencies that may or may not apply to a particular data set.
Arithmetic coding is also used in a variety of circumstances. Generally, compression ratios achieved using arithmetic coding are higher than those achieved using Huffman coding when the probabilities of data elements are more arbitrary. Like Huffman coding, arithmetic encoding is a lossless technique based upon the probability of a data element. However, unlike Huffman coding, arithmetic coding produces a single symbol rather than several separate code words. Data is encoded as a real number in an interval from one to zero (as opposed to a whole number). Unfortunately, arithmetic coding presents a variety of drawbacks. First, arithmetic coding is generally much slower than other techniques. This is especially serious when arithmetic encoding is used in conjunction with high-order predictive coding methods. Second, because arithmetic coding more faithfully reflects the probability distribution used in an encoding process, inaccurate or incorrect modeling of the symbol probabilities may lead to poorer performances.
Adaptive statistics provides a technique for dealing with some of the drawbacks involving prior knowledge of a symbol set. In general, adaptive encoding algorithms provide a way to encode symbols that are not present in a table of symbols or a table of prefixes. If an unknown symbol is detected, an escape code (ESC value) is issued and entered into the coded stream. The encoder continues the encoding process with a lower order prefix, adding additional data to the encoded bit stream. The lowest order prediction table (often a order 0 table) must contain all possible symbols so that every possible symbol can be found in it. The ESC code must be encoded using a probability. However, because of the unpredictable nature of new symbols, the probability of the ESC code cannot be accurately estimated from preceding data. Often, the probability of the ESC value for a given prefix is empirically determined, leading to non-optimal efficiency. Thus, introduction of an ESC code in adaptive encoding algorithms raises two problems. Firstly, the ESC code only gives limited information about the new symbol; the new symbol still has to be encoded using a lower order of prefix prediction table. The second problem is that the probability of the ESC code can not be accurately modeled.
Accordingly, it would be advantageous to provide a technique for lossless compression that does not suffer from the drawbacks of the prior art.