Contemporaneous data compression technologies tend to rely on one of two tactics: 1) redundancy removal by entropy-optimized re-encoding of a source file, based on self-similarity analysis of the file; or 2) static-dictionary-based substitution coding. A key difference between the two is that the former need not rely on external information being supplied. For example, any dictionary developed in the course of compressing an input file is built dynamically from redundancies in the input data itself. (The logic starts with no a-priori assumptions about the file nor any dictionaries.) In the latter, an externally developed dictionary or codebook is consulted for compression. For example, if the file to be compressed is an English text document, an English dictionary can be used to compress it. The text document is simply re-encoded as a series of offsets into the dictionary. (Of course, if an Arabic dictionary is substituted for the English dictionary, very poor compression can be expected, because the Arabic dictionary is not well suited to the data.) By virtue of avoiding any particular dictionary, the former is more general and thus more commonly used. But in use-cases where the latter can be exploited successfully, it tends to provide higher compression ratios.
Static-dictionary compression can be very effective. For example, products that fit the contents of a large reference source, such as the Bible, into a limited storage space of the size of a palm device generally achieve this feat by building a static dictionary from a concordance of the text, and using that dictionary to reconstruct the verses. This results in a high level of text reuse, since it is only necessary to store a common phrase like “thou shalt not” once, merely pointing to it from then on. Building dictionaries, however, represents a computational burden to (de)compression. Using existing dictionaries to fit to-be-compressed material represents difficulty in not only finding a best dictionary, but in making sure it has adequate entries corresponding to the material.
In view of these various problems, there is need in the art of dictionary-based (de)compression to easily find, use and/or build dictionaries to achieve excellent compression ratios. Making sure the dictionary is semantically well-tailored to the material is another need. In the world of computing, there is continually a need to leverage existing technologies. Any improvements along such lines should further contemplate good engineering practices, such as relative inexpensiveness, stability, ease of implementation, low complexity, flexibility, etc.