This invention relates generally to compression and decompression of digital data and more particularly to implementations of lossless compression and decompression methods and apparatus using a dictionary to store compression data, and applications of compression/decompression techniques to network packet communications.
A major class of compression schemes encode multiple-character strings using binary sequences or "codewords" not otherwise used to encode individual characters. The strings are composed of an "alphabet," or single-character strings. This alphabet represents the smallest unique piece of information the compressor processes. Thus, an algorithm which uses eight bits, to represent its characters, has 256 unique characters in its alphabet. Compression is effective to the degree that the multiple-character strings represented in the encoding scheme are encountered in a given file of the data stream. By analogy with bilingual dictionaries used to translate between human languages, the device that embodies the mapping between uncompressed code and compressed code is commonly referred to as a "dictionary."
Generally, the usefulness of a dictionary-based compression scheme is dependent on the frequency with which the dictionary entries for multiple-character strings are used. If a fixed dictionary is optimized for one file type it is unlikely to be optimized for another. For example, a dictionary which includes a large number of character combinations likely to be found in newspaper text files, is unlikely to efficiently compress data base files, spreadsheet files, bit-mapped graphics files, computer-aided design files, et cetera.
Adaptive compression schemes are known in which the dictionary used to compress given input data is created while that input data is being compressed. Codewords representing every single character possible in the uncompressed input data are put into the dictionary. Additional entries are added to the dictionary as multiple-character strings are encountered in the file. The additional dictionary entries are used to encode subsequent occurrences of the multiple-character strings. For example, matching of current input patterns is attempted only against phrases currently residing in the dictionary. After each failed match, a new phrase is added to the dictionary. The new phrase is formed by extending the matched phrase by one symbol (e.g., the input symbol that "breaks" the match). Compression is effected to the extent that the multiple-character strings occurring most frequently in the file are encountered as the dictionary is developing.
During decompression, the dictionary is built in a like manner. Thus, when a codeword for a character string is encountered in the compressed file, the dictionary contains the necessary information to reconstruct the corresponding character string. Widely-used compression algorithms that use a dictionary to store compression and decompression information are the first and second methods of Lempel and Ziv, called LZ1 and LZ2 respectively. The Lempel-Ziv (LZ) algorithm was originally described by Lempel and Ziv in "On the Complexity of Finite Sequences" IEEE Transactions on Information Theory, IT-22:75-81, January 1976; and in "A Universal Algorithm for Sequential Data Compression" IEEE Transactions on Information Theory, IT-23:337-343, May 1977; and "Compression of Individual Sequences via Variable Rate Coding" IEEE Transactions on Information Theory, IT-24:530-536. Dictionary usage is also disclosed in U.S. Pat. No. 4,464,650 to Eastman et al., and various improvements in the algorithms are disclosed in U.S. Pat. Nos. 4,558,302 to Welch, and 4,814,746 to Miller et al.
When working on a practical implementation, the amount of memory available for compression/decompression is finite. Therefore, the number of entries in the dictionary is finite and the length of the codewords used to encode the entries is bounded. Typically, the length of codewords varies between 12 and 16 bits. When the input data sequence is sufficiently long, the dictionary will eventually "fill up." Several courses of action are possible at this point. For example, the dictionary can be frozen in its current state, and used for the remainder of the input sequence. In a second approach, the dictionary is reset and a new dictionary created from scratch. In a third approach, the dictionary is frozen for some time, until the compression ratio deteriorates, then the dictionary is reset. Alternate strategies for dictionary reset are described in U.S. application Ser. No. 07/892,546, filed Jun. 1, 1992 entitled "Lempel-Ziv Compression Scheme with Enhanced Adaptation", as is hereby incorporated by reference herein, and by Bunton, S. et al., in "Practical Dictionary Management for Hardware Data Compression" Communications of the ACM, 5:95-104, January 1992.
In the LZW process, the dictionary must be initialized for the single-character strings that are used to build the compression dictionary. These characters are assigned unique codes within the compression/decompression system. This implies that the number of bits in any additional output code sent out by the encoder (e.g., codes that represent multiple character strings) are controlled by the number of single-character strings. For example, the shortest bit length for a multiple character string is determined by the number of single-character strings. The number of bits in subsequent codes representing multiple characters, increase in length by one bit every time the number of entries in the dictionary reach the next power of 2. Using more bits to represent single-character codewords proportionally decreases the overall compression performance.
The initialization of single input characters as described above is inefficient for input data with a large alphabet size or when only an unknown subset of the alphabet is expected to occur in the input data. For example, when the "natural" alphabet for the input data consists of 16-bit symbols, the initial dictionary size would have 65,536 entries. Therefore, the minimal length of any output code generated, in addition to the characters from the "natural" alphabet (e.g., codes representing multi-character strings) is at least 17 bits. Alternatively, if the block of input data (i.e., the data to be compressed) is small relative to the alphabet size, there is an unnecessarily high overhead in time, memory space, and compression ratio that comes from initializing, storing, and encoding, respectively, single-character strings from the input data.
To overcome these problems, some variants of the LZ algorithm employ an empty initial dictionary. When a new input character is encountered, the compressor outputs a special code, followed by a copy of the new character. This allows the decompressor to keep track of a subset of the input. alphabet that is actually in use, allowing decoding to proceed as usual. The main problem with this strategy is the high cost of encoding new characters. For short files over large alphabets, this overhead cost might become unacceptably high. For instance, with 8-bit symbols and 12-bit output codes, 20 bits are required to let the decoder know a new character has occurred. In addition, often there is redundancy within the encoded character strings output by the LZ algorithm. For example, a string of the same input characters (i.e., a "run") produces a sequence of encoded strings with a predictable and redundant structure. This redundancy is not presently leveraged to further increase the compression ratio of standard compression algorithms.
Accordingly, a need remains for a data compression initialization process that is adaptable to different types of input data and different data structures to increase the data compression ratio and to reduce the amount of memory required in a dictionary based compression/decompression system.