1. Field of the Invention
The invention relates to LZ data compression and decompression systems particularly with respect to preventing storage of infrequently encountered data character strings in the compressor and decompressor dictionaries.
2. Description of the Prior Art
Professors Abraham Lempel and Jacob Ziv provided the theoretical basis for LZ data compression and decompression systems that are in present day widespread usage. Two of their seminal papers appear in the IEEE Transactions on Information Theory, IT-23-3, May 1977, pp. 337-343 and in the IEEE Transactions on Information Theory, IT-24-5, September 1978, pp. 530-536. A ubiquitously used data compression and decompression system known as LZW, adopted as the standard for V.42 bis modem compression and decompression, is described in U.S. Pat. No. 4,558,302 by Welch, issued Dec. 10, 1985. LZW has been adopted as the compression and decompression standard used in the GIF image communication protocol and is utilized in the TIFF image communication protocol. GIF is a development of CompuServe Incorporated and the name GIF is a Service Mark thereof. A reference to the GIF specification is found in GRAPHICS INTERCHANGE FORMAT, Version 89a, Jul. 31, 1990. TIFF is a development of Aldus Corporation and the name TIFF is a Trademark thereof. Reference to the TIFF specification is found in TIFF, Revision 6.0, Finalxe2x80x94Jun. 3, 1992.
Further examples of LZ dictionary based compression and decompression systems are described in the following U.S. patents: U.S. Pat. No. 4,464,650 by Eastman et al., issued Aug. 7, 1984; U.S. Pat. No. 4,814,746 by Miller et al., issued Mar. 21, 1989; U.S. Pat. No. 4,876,541 by Storer, issued Oct. 24, 1989; U.S. Pat. No. 5,153,591 by Clark, issued Oct. 6, 1992; U.S. Pat. No. 5,373,290 by Lempel et al., issued Dec. 13, 1994; U.S. Pat. No. 5,838,264 by Cooper, issued Nov. 17, 1998; U.S. Pat. No. 5,861,827 by Welch et al., issued Jan. 19, 1999; and U.S. Pat. No. 5,951,623 by Reynar et al., issued Sep. 14, 1999.
In the above dictionary based LZ compression and decompression systems, the compressor and decompressor dictionaries may be initialized with all of the single character strings of the character alphabet. In some implementations, the single character strings are considered as recognized although not explicitly stored. In such systems the value of the single character may be utilized as its code and the first available code utilized for multiple character strings would have a value greater than the single character values. In this way the decompressor can distinguish between a single character string and a multiple character string and recover the characters thereof. For example, in the ASCII environment, the alphabet has an 8 bit character size supporting an alphabet of 256 characters. Thus, the characters have values of 0-255. The first available multiple character string code can, for example, be 258 where the codes 256 and 257 are utilized as control codes as is well known.
In the prior art dictionary based LZ compression and decompression systems, specific methodologies often require that the dictionary be limited to a fixed size. For example, in the GIF protocol, the dictionary is limited to a maximum of 4095 strings with a concomitant maximum code size of 12 bits. When filled to maximum capacity, the dictionary may be frozen and utilized with the extant stored strings to perform further compression until such time as it is desirable to clear the dictionary contents.
During operation of the LZ methodology, the dictionary fills with data character strings and string fragments some of which may be only infrequently encountered. Thus, a number of the available codes may be occupied with string fragments that may only rarely, if ever again, be encountered. The codes so occupied would not significantly contribute to the compression of the input data character stream. Since, as discussed above, the dictionary may be limited in size, the number of available codes occupied by these rarely encountered string fragments may be significant which, it is believed, will have an adverse effect on compression efficiency.
The present inventor believes that excluding infrequently encountered strings from the dictionary would improve compression performance. It is furthermore believed that no method or apparatus exists in the data compression/decompression art for specifically excluding infrequently encountered strings from being stored in the dictionary and occupying valuable dictionary codes.
In said U.S. Pat. No. 5,951,623, a pre-filled dictionary is used in combination with a data specific dictionary where the pre-filled dictionary is pre-loaded with commonly occurring character sequences. The data specific dictionary is utilized in the normal LZ compression mode and the longest match from either the pre-filled dictionary or the data specific dictionary is utilized as the compressed output. Extended strings are stored in the data specific dictionary. It is believed that rarely occurring character sequences may be entered into the data specific dictionary since these rarely occurring sequences have no counterpart in the pre-filled dictionary. Even though the pre-filled dictionary stores frequently occurring sequences, infrequently occurring sequences can still usurp valuable codes from the data specific dictionary as discussed above.
It is an objective of the present invention to provide a data compression and decompression system that prevents infrequently occurring data character strings from occupying valuable codes in the compression and decompression dictionaries.
The system of the present invention includes a data compressor for compressing an input stream of data characters into an output stream of compressed codes. A dictionary stores strings of data characters encountered in the input stream, the stored strings having respective codes associated therewith. The input stream is searched by comparing the input stream to the stored strings to determine the longest match therewith. The code associated with the longest match is output so as to provide the output stream of compressed codes. An exclusion table is included storing strings of data characters to be excluded from storage in the dictionary. An extended string is formed comprised of the longest match extended by the next data character in the input stream. If it is not in the exclusion table, the extended string is stored in the dictionary and a code is assigned thereto. If it is in the exclusion table, the extended string is not stored and the code remains available for another string.
Specifically, the input stream is compared to the strings stored in the dictionary until a mismatching input character occurs. In this manner the longest match is determined. The mismatching character is used to begin the next string search unless it is included in the exclusion table. If so, the mismatching character is outputted and further input data characters are fetched and outputted until an input data character is fetched that is not in the exclusion table. This character is then used to begin the next string search.
A data decompressor includes the same exclusion table as included in the compressor. Utilizing the exclusion table, the decompressor excludes the same strings from storage in the decompressor dictionary as are excluded from storage in the compressor dictionary.