Dictionary-based compression techniques for compressing textual data typically employ keyword dictionaries. The keyword dictionaries identify static words or static phrases by identifiers, such as small codes. Compressing textual data using such a dictionary includes replacing the static words and static phrases of the dictionary that are present within the textual data with their corresponding identifiers. The dictionary is stored with the compressed textual data so that the dictionary can be used in decompressing the textual data when needed.
Such prior art dictionary-based compression techniques achieve compression, but typically with a large degree of redundancy within the dictionary itself. For example, there may be occurrences of the word “compression” and the phrase “complete compression.” A dictionary may store one key for the word “compression” and another key for the phrase “complete compression,” or it may store one key for the word “compression” and another key for the word “complete.” In the former instance, the dictionary redundantly stores the word “compression” twice: a first time as its own key, and a second time as part of the phrase “complete compression.” In the latter instance, replacing the phrase “complete compression” within textual data to be compressed involves using two identifiers, one of the word “complete” and another for the word “compression,” instead using a single identifier as in the former instance.
Furthermore, existing dictionary-based compression techniques are unable to efficiently compress dynamic patterns within textual data. For instance, a first phrase within the textual data may be “My friend Harish does a good job,” and a second phrase within the textual data may be “My friend Sateesh does a great job.” The pattern for these two phrases is “My friend [1] does a [2] job,” where the words identified by “[1]” and “[2]” differ between the two phrases. Existing dictionary-based compression techniques just replaces keys for the words and phrases “My friend,” “does a,” “good,” “great,” and “job” within each of these phrases, for a total of five identifiers plus the word “Harish” or “Sateesh” (which remains uncompressed), which is a simplistic and non-maximal compression of the phrases.
These and other shortcomings of the prior art are addressed by the present invention.