A trie is a data structure that is useful for compressing lexical data such as a list of dictionary words. Tries are composed of states, with a top-level state representing, for example, each of the first letters (e.g., a-z) of all valid words in a given dictionary. Each state is comprised of nodes, wherein each node represents a valid letter in that state, along with some information about that letter, such as a pointer to a lower state (if any). Each state represents a transition from one character in a word to the next. For example, the letter “q” in one state usually transitions to the letter “u” in a next lower state.
To use the trie, such as to find if a user-input word is a valid word in the dictionary, a search through the states is performed. For example, to find the word “the,” the top-level state in the trie is searched until the “t” node is found, and then a next lower level state pointed to by the “t” node is searched to determine if there is an “h” node therein. If not, the word “the” would not be a valid word in that dictionary. However, if there is an “h” node in the state pointed to by the “t” node, the “h” node is examined to find a next state, if any. The state pointed to by the “h” node is then searched to find out whether there is an “e” node therein. If there is an “e” node, to be a valid word, the “e” node needs to be followed by some indication (e.g., a flag) indicating that a valid word exists at this time, regardless of whether the “e” node points to a further state. In a trie-structured dictionary that properly represents a list of words in the English language, “the” would be a valid word, and thus the top-level state would have a “t” node, the next state pointed to by the “t” node would have an “h” node therein, and the state pointed to by that “h” node would have an “e” node therein with a valid flag set. If characters such as “thj” were searched, however, the “t” node would transition to the next state which would have an “h” node therein, but the next state pointed to by “h” node would not include a “j” node, and thus this word would not be a valid word.
While storing words in a trie structure is efficient in terms of both storage and access time, it is difficult to attach information to individual words in the trie. One known way to attach information to certain individual words stored in a trie is to tag selected words by setting a single “tag” bit in the last node of each selected word. Tagging is useful for identifying a small or regular subset of words for special processing upon decompression. For example, some words are slang words, which although acceptable (e.g., to a spell checker), are not recommended (e.g., by a thesaurus). If a trie is used to store words, the slang words can be tagged, whereby upon decompression, those words stand out from the rest. Then, the spell checker may ignore the tag, while the thesaurus may recognize the tag and thereby delete or change the appearance of the word in a list of synonyms presented to a user.
Another technique for associating information with words is known as global enumeration. Global enumeration is a technique that maps each word in the word list to a number and maps that number back to the same word, i.e., the number may be used to determine its associated word, and vice-versa. The numbers are dense, e.g. if there are N words in the list, then the words map to the range zero to N minus one. The number may serve as an index to information associated with specific words, which is useful if the same type of information is attached to every (or most) words in the list with little or no pattern. For example, the words in a thesaurus may be stored in a trie and enumerated, whereby the number associated with each word may serve as an index to a table of synonyms, a table of antonyms and so on. The tables themselves may be lists of numbers representing associated words that map back to the trie. By way of example, the user may want a synonym for a word that is enumerated in the trie as 957, whereby 957 is used as an index to a table of synonyms, resulting in the numbers 2040, 902 and 457 being retrieved. Those retrieved values are then used to find their corresponding words in the trie for display to a user.
While tagging and enumeration are thus helpful techniques, they are essentially limited to solving only their specific types of problems, i.e., marking certain words, or associating each of the words in a trie with a unique indexing number. Thus, these solutions work in certain circumstances, however there are many word lists that would benefit from having additional information stored with the word, and the existing techniques are neither flexible enough nor extensible to solve the problem in an efficient manner. For example, certain languages have gender associated with certain words, but not all words. Thus, a single bit is not sufficient to represent male, female or gender neutral. Separately tagging more than one subset of words can be done by setting aside an additional bit in each node for each additional subset, (e.g., one bit for gender or not, one bit for male or female), however reserving such tagging bits in each node reduces compression. While enumeration could be used to store the related gender information in an indexed table, enumeration requires the storing of numbers with the nodes, which in some instances is very inefficient, such as if enumeration is not otherwise needed and only a few words need such associated information.