The present invention relates to the mapping of words into numbers and of numbers into words. More specifically, the invention relates to techniques for handling a given word within a digital data processing system by mapping that word into a unique number and, when necessary, mapping the number back into the word.
In general, to handle a string of elements such as a word in a digital data processing system, it is frequently desirable to map the word into a number. While words vary greatly in length, a word's corresponding number can ordinarily be expressed digitally as a binary number of a fixed length shorter than the digital codes for that word's alphabetical characters, so that handling the corresponding binary numbers is far more efficient than handling the words themselves. Another reason for mapping a word into a number is to obtain an address or pointer to access information relevant to that word.
U.S. Pat. No. 4,384,329 describes a technique for accessing synonyms and antonyms in which the first few characters of an input word are used to search an index for an address of a segment of a vocabulary data base containing the input word. That segment is then searched for a matching word with which is stored a word number, which is the row and column corresponding to the input word in a synonym or antonym matrix. The matrix is then accessed to retrieve a row of encoded synonymy information, which is then decoded into column displacements. The displacements are converted into a list of synonym word numbers, and these numbers are decoded into the synonyms themselves, again using the index. This technique thus involves mapping an input word to a number, using that number to retrieve the numbers of its synonyms, and mapping the synonym numbers to the synonymous words. Mapping words to numbers and numbers to words is an important part of this technique.
Published PCT Application WO85/01814, corresponding to U.S. patent application Ser. No. 543,286, discusses data compression techniques in which a dictionary assigns a unique address or token to each word of the text being compressed. A special bit is set in the code for the first character of each word, and these special bits are counted while scanning the dictionary until a given word is found, so that the count of special bits is the token representing that word. A lookup table for accessing the first word beginning with each alphabetic letter also provides the token for each first word, so that the counting is speeded by beginning with the token from the table. Similarly, a token is converted to a word by decrementing the token for each special bit encountered in scanning the dictionary, and the next word after the token reaches zero is the corresponding word. This can be speeded by comparing the tokens in the lookup table with the token in reverse order until the first stored token is found which is less than the token being converted. The word represented by the token being converted can then be found by beginning with the first word corresponding to that token from the table, since that first word begins with the same letter as the word sought. The dictionary may be reduced in size by replacing the initial characters of a word by a number if they are the same as the initial characters of the preceding word.
U.S. Pat. No. 4,597,055 describes a language translator which similarly obtains the serial number of an input word by scanning and counting a special bit, which is set in the first code of each stored word. The stored words and stored sentences are compressed by using codes for combinations of letters. The codes are decoded using a compression table. A sentence may be stored in more than one place, in which case the sentence itself is stored in one place and only the address of that place is stored in the other places.
Ainon, R. N., "Storing Text Using Integer Codes", Proceedings of the 11th International Conference on Computational Linguistics, Bonn, West Germany, August 1986, pp. 418-420, describes a text storage technique to facilitate word manipulation and save storage space. The text is stored as a stream of fixed length computer words, each a unique integer code for a corresponding word. The word list used in encoding includes groups of words, with the relative position of a member in a group providing syntactic information, and with the syntactic information depending on which of a number of sets includes the group. Each word is stored in a linked list with links to the base word of its group and to the next word in the group and with an identifier of the set which includes that group. The code table used in encoding indicates the two byte integer codes for words from this word list, each code pointing to the position of the corresponding word in the list. For faster encoding, some words are stored in a hash table which is searched before searching the code table, to save time when encoding common words.
U.S. Pat. No. 4,241,402 describes a finite state automaton (FSA) which can be used to determine whether a received pattern of characters corresponds to one of a number of desired patterns. If so, a state in the FSA is reached which contains a report code identifying the desired pattern which has been matched, and that code, which could be a unique number, is returned. This means that each branch of the FSA contains one or more unique report codes, precluding the collapse of otherwise identical branches into a single branch. In addition, mapping a number back to a word would require searching the FSA until the number was found, a time consuming process.
U.S. Pat. No. 4,092,729 describes a hyphenation technique which verifies the spelling of an input word by calculating a vector magnitude and angle for the word, the magnitude and angle being used to access a memory. This is an example of mapping a word to a number by hashing, meaning that a computation is performed on the characters of the word to obtain the corresponding number. There are many possible hashing techniques, but hashing generally involves a tradeoff between uniqueness of the corresponding numbers and density of the set of corresponding numbers. The technique of U.S. Pat. No. 4,092,729, for example, is described as producing a unique angle representation, which implies a lack of density, because not all available angle values are used. If density were attained by using all available angle values, then some of the words would probably have the same angle representation, eliminating uniqueness. Uniqueness ensures accurate results and permits number to word mapping, but density permits compact storage.
A number of other techniques for mapping words to numbers or numbers to words are known. Published European Patent Application 158,311 describes apparatus for retrieving or searching character strings in which sequential logic provides a class number in response to an input word. Published European Patent Application 168,814 describes a language processing dictionary in which a pointer corresponding to an input expression is accessed in an index file using a B-tree search. U.S. Pat. No. 4,608,665 describes a dictionary in which the address at which an input word is stored is found by addressing a memory until a word which matches the input word is retrieved.
It would be advantageous to have techniques for mapping words to numbers and numbers to words rapidly using a compactly stored word list. It would further be advantageous if such techniques mapped a set of words to a dense set of numbers, each corresponding to only one of the words.