A trie is a data structure that is useful for compressing lexical data such as dictionary words. Tries are composed of states, with a top-level state representing, for example, each of the first letters (e.g., a-z) of all valid words in a given dictionary. Each state is comprised of nodes, wherein each node represents a valid letter in that state, along with some information about that letter, such as a pointer to a lower state (if any). Each state represents a transition from one character in a word to the next. For example, the letter "q" in one state usually transitions to the letter "u" in a next lower state.
To use the trie, such as to find if a user-input word is a valid word in the dictionary, a search through the states is performed. For example, to find the word "the," the top-level state in the trie is searched until the "t" node is found, and then a next lower level state pointed to by the "t" node is searched to determine if there is an "h" node therein. If not, the word "the" would not be a valid word in that dictionary. However, if there is an "h" node in the state pointed to by the "t" node, the "h" node is examined to find a next state, if any. The state pointed to by the "h" node is then searched to find out whether there is an "e" node therein. If there is an "e" node, to be a valid word, the "e" node needs to be followed by some indication (e.g., a flag) indicating that a valid word exists at this time, regardless of whether the "e" node points to a further state. In a trie-structured dictionary that properly represents the English language, "the" would be a valid word, and thus the top-level state would have a "t" node, the next state pointed to by the "t" node would have an "h" node therein, and the state pointed to by that "h" node would have an "e" node therein with a valid flag set. If characters such as "thj" were searched, however, the "t" node would transition to the next state, which would have an "h" node therein, but the next state pointed to by "h" node would not include a "j" node, and thus this word would not be a valid word.
Tries are used in many types of devices, including those wherein storage space is at a premium. To save space, tries are compressed by using known compression techniques, including those that attempt to efficiently store the information in the trie. Previous compression technologies exploited similarities in both the prefixes and suffixes of words, known as head merging and tail merging, respectively. In head merging, for example, all words in a trie that begin with "ja" share the "j" of the top level state, which points to a next level state with a single "a" node therein. In tail merging, for example, all words that end with an "s" essentially end with the same information, i.e., an "s" node that is marked as terminal, and thus may share a single "s" terminal state.
While tail merging saves a significant amount of space, tail merging is limited in that only completely identical subtrees in the trie may be merged. In other words, tail merging cannot be used where subtrees are only partially the same. This limits its usefulness as a compression technique, particularly in languages such as English wherein there are many exceptions to the way words are spelled. For example, in a (limited) dictionary the words "be't'" and "we't'" may share the same endings (suffixes) of "s'," "ter'" and "ting'," where the apostrophe (') represents a valid word flag. However if "be't'" has a further suffix of "tor'" that is not shared by "we't," only the "r'" and the "ng'" endings may be merged via tail compression. In sum, even though the subtrees are nearly identical, only the parts thereof that are actually identical may be shared in tail compression.