While storing or communicating digital signals, i.e., data, it is well known that compressing the data saves space and time. Compressed data can be stored in less memory, and compressed data takes less time to travel along communication lines.
One set of commonly used compression techniques are based on the compressor and the de-compressor sharing data in what is commonly known as a dictionary. The dictionary can be fixed or adaptive as described below. The dictionary can be used to translate the data to a compressed form, and the inverse transformation can be applied to the compressed data to recover the original data. Compression advantages can be gained when the dictionary is sensitive to the content of the data. For example, different dictionaries would probably be used to compress data representing speech and video signals. Ideally, better compression can be achieved when the dictionary is highly dependent on the underlying data.
In one well known type of dictionary based compression, for example, Huffman encoding, a two stage process is used to produce a content sensitive dictionary. During the first pass, the compression process makes a partial or complete pass over the data to "learn" the relative frequency of compressible bit patterns. Bit patterns which occur frequently are then substituted with short codes, and less frequently occurring patterns are translated into longer codes, or perhaps not at all. During the second pass, original data are compressed according to the code substitutions defined by the fixed dictionary generated during the first pass. Decompression simply uses the dictionary to perform the inverse translation.
Another set of dictionary-based substitutional compression schemes is known as Lempel-Ziv (LZ) encoding, including LZ77, LZ78, LZW, etc. There, during a single pass, groups of bits (or characters) are encoded by referring to previous occurrence of the same group of bits of characters in the data record. In this case, an adaptive dictionary expresses a mapping between indices and previous occurrences of encoded patterns.
These types of compression techniques generally produce what is called "self-contained" output. That is, all the receiver needs is some generic implementation of a decompression process and the message itself, no external data are needed. The self-contained property requires that the compressed form of the message must include, in some way, the information about the dictionary.
For large messages, the overhead introduced by the dictionary is generally relatively small when compared with the time required to encode and decode, although the dictionary can grow quite large. Various schemes have been proposed for keeping the dictionary within some bounded size.
One place where data can benefit from compression is the World-Wide-Web (the "Web"). Over recent years, the amount of data stored and communicated via the Web has grown exponentially, particularly taking into consideration Web pages and Web e-mail. One drawback of known compression schemes is that most Web messages are relatively short, about 7K bytes per message.
Because adaptive self-contained schemes start with an empty dictionary, the compression efficiency for these small files is not as good as it would be for much larger files. For example, while a very large file might be compressed by a factor of ten, Web messages might only compress by a factor of two. Actual ratios may vary depending on content and technique used.
Therefore, it is desired to provide a dictionary based compression technique which works efficiently with small sized files.