The term data compression refers generally to the process of transforming a set of data to a smaller representation from which the original or some approximation of the original can be extracted at a later time. A data compression process generally includes the steps of encoding a body of data into a smaller body of data and the step of decoding the compressed data back into the original body of data or an acceptable approximation thereof. One application for data compression is for textual data such as information in books, programming language source or object code, database information, numerical representations, etc. In these types of data, it is generally important to preserve the original data exactly by what is referred to as lossless data compression. While there are numerous applications for data compression, two of the most common applications include data storage and data communications.
In the design of a data compression system, there is a tradeoff between the benefits of obtaining compressed data versus the computational costs incurred by performing the encoding and decoding. Generally, compression is judged in terms of a compression ratio, which relates input character size to the output size. As computer technology advances, data compression is becoming a standard component of communication and data storage systems. A common computer configuration includes a data encoding/decoding chip at the ends of a communication channel. Alternatively, data compression software can be executed during the communication transmit and receive processes.
A widely used textual substitution data compressor is based on the Lempel and Ziv scheme. This scheme is described in detail in J. Ziv and A. Lempel, "Compression of Individual Sequences Via Variable Rate Coding," IEEE Transactions on Information Theory, IT-24(5):530-536, September 1978. A more pratical reduction of the Lempel and Ziv scheme was developed by Welch; this pratical scheme is described in T. A. Welch, "A Technique for High-Performance Data Compression," Computer, 8-19 (June 1984), and is the subject of U.S. Pat. No. 4,558,302. Generally, the scheme passes an input string to be compressed in accordance with a string table or dictionary that includes a list of previously parsed input strings. Thus, the string table is based on the data in the input string so that certain compressor characteristics are dependent upon the particular string being compressed. The general operation of the Liv-Zempel-Welch (LZW) encoder is described in conjunction with prior art FIGS. 1 and 2.
With reference to FIGS. 1a and 1b, the parsing of string abababaabacabacb by an LZW data compression algorithm is represented by a string table and a string addition table, respectively, The string is parsed over the alphabet .SIGMA.={a, b, c} and has an unbounded dictionary size. This alphabet is merely for descriptive purposes. An example of a realistic alphabet is the set of ASCII characters. The present invention improves on the LZW scheme by controlling the dictionary size and utilization.
The LZW scheme is organized around a string table that maps strings of input characters into fixed length codes. The string table has the property that for every string or word in the table its prefix strings are also stored in the table. For example, if the string .omega.K is composed of a string .omega. and a single extension character K is in the table, then the prefix .omega. is in the table, as are all prefixes of .omega..
The string table is initialized to the one letter strings over the alphabet {a, b, c}; each of these strings is assigned an output code {1, 2, 3}, respectively. The input character size is assumed to be 8-bits. The input string is analyzed character serially, in this case left to right, and the longest matched input string is parsed off at each pass and its code is output. The set of codes represents the compressed data.
With reference to FIG. 2, an encoding pass begins at block 10 by obtaining an extension character from the string. The extension character is always the first unmatched character from the input string. For the initialization, the extension character is the first character in the input string. At block 12, a check is made to determine whether the entire input string has been encoded. At block 14, there is input available. The extension character is appended to the previously matched string, i.e., the prefix string, if there is one and the new string is matched against the strings in the string table. At block 16, when the new string is matched, a new prefix string is created and the process returns to block 10.
If, at block 14, no match is found, at block 18, the new string is added to the string table at an unused code address. The code related to the last matched string is output. The process returns to block 10. If the end of the file is reached, at block 20, the code for the prefix is output.
In the example shown in FIG. 1, the string table is initialized with the characters of the alphabet associated with codes 0-2. Assuming that the first 3 entries of the string table are completed, a parsing example using the string abababaabacabacb is described. At the beginning of the first iteration a is obtained from the input string and matched against the string entries. Since a matches the first entry in the string table, the extension character b is obtained from the input string and appended to the prefix a to form string ab. An attempt is then made to match ab in the string table. Since no match is made, the string ab is added to the string table at position 3 and a new code 3 is added to the table. The code value 0, for the matched string a is then output.
Since the last match was not successful, the next iteration begins using extension character b, which is the last unmatched character in the input string as a prefix. The string b matches the b entry in the table. Since the match was successful, the next extension character a, i.e., the third character a in the input string, is read from the input string and appended to prefix b. A match of string ba is attempted and fails. When the match fails, an entry for ba is added to the string table and assigned output code 4. Code 1, related to matched string b, is output and a new iteration begins with the unmatched extension character a as a prefix. This process continues until the entire input string is parsed.
Decompression is performed by reconstructing the encoding string table. For example, the decoder will receive 12435 . . . The decoder reconstructs the string table starting from an initial table consisting of the known single character entries {a, b, c}. For example, after output codes 0 and 1 are decoded as a and b, the decoder enters ab into the string table with code 3, the next available code. Output code 3 is then received and decoded as ab, and the entire decoded string is abab. The decoder identifies the next string table entry as ba since b was the last matched string and a is its extension character. The strings produced by the decoding method are backwards since they are decoded from end to beginning, i.e., right to left. Thus, some type of string buffering and reversal are performed. As the string is decoded, each symbol is buffered. When the code is completely decoded, the buffer contents are output in reverse order.
In the example, the output codes are a fixed length. A common length is 12-bits. Thus, approximately 4000 different string table entries can be encoded. The dictionary growth mechanism defined by the addition of the last parsed word concatenated with the first unmatched character causes the dictionary to contain every prefix of every word it holds. For that reason, an LZW implementation may use a large amount of dictionary space. Furthermore, even if the characteristics of the data change after the beginning of the input string is encoded, the early word entries will remain in the string table.
Since realistic implementations have finite memory, dictionary or string table growth must be bounded. Given a limited amount of dictionary space, it is desirable to establish a fully adaptable dictionary, that is, a dictionary that continues to change with the input stream, even after it fills. An adaptable dictionary provides better compression as opposed to a dictionary that reflects only the initial portion of the input string, and therefore may not reflect changing patterns over the entire input stream file.
A number of dictionary management schemes have been developed which attempt to ensure the adaptation of the dictionary for optimal compression. When designing and adapting a dictionary management scheme, compression success and memory management expense are balanced. It has been shown that without adaptation every doubling of the dictionary size usually results in an improvement in the compression ratio for sufficiently large input streams.
A dictionary management scheme referred to as FLUSH builds a dictionary of a given size and then starts completely over with dictionary building if the compression ratio falls below a predetermined limit. Optionally, the dictionary begins with 512 entries, i.e., a 9-bit output code. When the dictionary fills, its size doubles and the code length increases by one. This process continues until a maximum size is reached, i.e., 64K entries with 16-bit codes. No further additions are made to the dictionary and only matched strings are encoded. The compression ratio is monitored and, if it drops below a predetermined threshold, the dictionary is flushed and returned to its original 512-word size. This process repeats until the entire input string has been encoded. The variable-sized dictionary aspect of FLUSH methodology is advantageous when used for compressing small files. However, extra buffering overhead is required for the variable width output codes. This is a significant consideration since outputs must be handled bit serially.
A second method for dictionary management is the least recently used (LRU) methodology. A detailed description of this method is described in V. Miller and M. Wegman, "Variations On a Theme" by Ziv and Lempel, Combinatorial Algorithms on Words, Springer-Verlag, 1985, 131-140. Typically, an LRU implementation uses a linked list or queue of pointers to a set of tree nodes. The queue orders the tree nodes by recency of use. The LRU queue must be doubly linked to allow time constant deletions. For time constant searches, each tree node also keeps a pointer to its associated link in the LRU queue. Thus, four extra pointers per dictionary entry are used. This increases the amount of memory required to store the dictionary. Additionally, for every input character scanned, the location, removal, and insertion of a queue link is necessary. Thus, the computation requirements substantially exceed those of the FLUSH method.
A third dictionary management methodology is SWAP. In this scheme a primary and a secondary dictionary are maintained. When the primary dictionary fills, insertions are made in the secondary dictionary while the encoding continues based on the contents of the primary dictionary. Whenever the secondary dictionary fills, the roles of the dictionaries are swapped and the primary dictionary is reset, e.g., flushed and redefined, which the secondary dictionary is used for encoding. One variation of SWAP is a method that begins by filling the secondary dictionary once the primary dictionary is half full, and swaps the dictionaries whenever either dictionary fills. Both of the SWAP schemes require that memory and computation resources of an LZW compressor be doubled.