1. Field of the Invention
The invention relates to data compression. More particularly, the invention relates to dictionary methods and apparatus that perform lossless data compression.
2. Description of the Related Art
Lossless data compression relates to a category of data compression methods in which the recreated or reproduced (decompressed) data is an exact replication of the original data. Lossless data compression is compared with lossy data compression, in which the recreated data is different form the original data, i.e., there is some distortion between the original data and the recreated data.
Lossless data compression can be broken down into four categories: defined word compressors, the algebraic compressor, context aware compressors, and dictionary compressors. Defined word compressors operate by attempting to find an optimal mapping between messages and codewords such that the number of symbols in each codeword matches the information content of the message. The algebraic compressor is a distinct compression algorithm that operates by calculating a single unique number (represented by an arbitrarily long bit sequence) based on the probabilities of the individual messages. Context aware compressors operate by taking advantage of previously obtained or derived knowledge of an ensemble to represent the ensemble in a more compact form.
Dictionary compressors operate by combining groups of messages together into new messages to create a new ensemble with higher information entropy and shorter length. That is, as a bit stream is read, a collection of bit patterns encountered in the bit stream (a “dictionary”) is compiled. When a previously encountered bit pattern is seen in the bit stream, a dictionary code identifying an entry in the dictionary corresponding to the bit pattern, rather than the bit pattern itself, is substituted in the bit stream. The dictionary code usually is represented by a number of bits that is less than the number of bits in the bit pattern that the dictionary code identifies. Thus, significant saving in storage space or transmission time can be realized, especially in a binary image where repetition of bit patterns occurs frequently.
Dictionary compressors typically fall into two classes, those based on the (Lempel-Ziv) LZ77 compression algorithm and those based on the LZ78 compression algorithm. The LZ77 compression algorithm operates by examining messages one by one, locating identical sequences of messages backwards in time in the ensemble. When a match is found, a new message is inserted into a compressed ensemble in place of the repeating messages. The new message indicates the distance or offset backwards in the compressed ensemble as well as the number of messages that have been found to repeat (the length).
Depending on the application, the LZ77 compression algorithm can have a number of drawbacks. For example, each newly added message in the compressed ensemble requires two pieces of information, a distance or offset and length. Also, the compressor and decompressor must search backwards through the compressed sequence to locate cases where the pattern repeats. Such searching requires that the compressor and decompressor maintain an image of the entire compressed sequence up to the last received message in the compressed ensemble. Typically, such an image is not maintained; instead, a sliding window (e.g., 4 k in length) is maintained for both the compressor and decompressor.
To overcome these issues, the LZ78 compression algorithm was proposed. The LZ78 compression algorithm maintains a dictionary of previously seen sequences of messages in the original ensemble. As the compressor walks through the ensemble, the ensemble is broken down into distinct sequences made up of an already seen sequence of messages followed by the first message that would make the sequence non-repeating. The resulting compressed sequence is represented by tuples made up of an index into the dictionary for the repeating part of the sequence followed by the message that makes the sequence non-repeating.
An improvement to the LZ78 compression algorithm, called LZW (Lempel-Ziv-Welch), subsequently was proposed. The LZW compression algorithm varies from the LZ78 compression algorithm in that the dictionary is preloaded with all the messages in the alphabet associated with the ensemble. The compressor and decompressor can then infer the dictionary entries based on the previous entries in the original ensemble, and therefore do not need to include both the dictionary entry and the next unique message in the output ensemble.
The LZW compression algorithm is a modification of the LZ78 compression algorithm. Both compression algorithms store entries in the dictionary in the form of 1) input messages that made the previous output message non-repeating and 2) the new input message. The primary difference between the LZ78 compression algorithm and the LZW compression algorithm is that the LZW compression algorithm can infer the dictionary from the input sequence during compression and from the compressed sequence during decompression.
Despite the development of the LZW compression algorithm, there is a need for an output-driven dictionary compression method that has many of the traditional features of the LZW compression algorithm, but, unlike the LZW compression algorithm, is not based on either the LZ77 compression algorithm or the LZ78 compression algorithm.