1. Field of the Invention
This invention relates to text processing and more particularly to methods for reducing the storage required to store a dictionary of text words.
2. Description of the Prior Art
In implementing a practical automatic spelling verification or hyphenation system, the ultimate number of entries in the dictionary storage file is a very important factor which affects the practicality of the system in terms of both cost and efficiency of operation. Procedures have evolved in the prior art for reducing the size of storage files. One of the procedures requires that entries in the storage file that are not being used, or that are used least, be eliminated.
A second procedure for reducing file size requires the compaction of the data in the file by run-length encoding the characters and symbols by replacing a string of identical characters with an identifier and a count.
A third procedure for reducing file size requires substituting more efficient codes for those commonly used, for example, substituting Huffman codes for fixed byte codes wherein characters that occur more frequently are represented by fewer bits than those which occur infrequently. This technique can be enhanced by grouping together various sets of characters having similar occurrence properties.
A fourth procedure for reducing the storage file size requires substituting a vector representation of a magnitude and unique angle for each word in the dictionary file where common magnitudes are run-length encoded with the corresponding angles. The theory behind the file size reduction of this technique is that less storage space is required to store the magnitude and angle representation of a word than is required to store the characters of the average length word. The run-length encoding of the magnitudes provides an enhancement to this technique.
However, a major element of dictionary content which causes a word list file to ballon in size in the multiplicity of forms a root word may take on due to prefixes and suffixes. This problem has not been addressed in the prior art.