A trie is a data structure that is useful for compressing lexical data such as dictionary words. Tries are composed of states, with a top-level state representing, for example, each of the first letters (e.g., a-z) of all valid words in a given dictionary. Each state is comprised of nodes, wherein each node represents a valid letter in that state, along with some information about that letter, such as a pointer to a lower state (if any). Each state represents a transition from one character in a word to the next. For example, the letter "q " in one state usually transitions to the letter "u" in a next lower state.
To use the trie, such as to find if a user-input word is a valid word in the dictionary, a search through the states is performed. For example, to find the word "the," the top level state in the trie is searched until the "t" node is found, and then a next lower level state pointed to by the "t" node is searched to determine if there is an "h" node therein. If not, the word "the" would not be a valid word in that dictionary. However, if there is an "h" node in the state pointed to by the "t" node, the "h" node is examined to find a next state, if any. The state pointed to by the "h" node is then searched to find out whether there is an "e" node therein. If there is an "e" node, to be a valid word, the "e" node needs to be followed by some indication (e.g., a flag) indicating that a valid word exists at this time, regardless of whether the "e" node points to a further state. In a trie-structured dictionary that properly represents the English language, "the" would be a valid word, and thus the top-level state would have a "t" node, the next state pointed to by the "t" node would have an "h" node therein, and the state pointed to by that "h" node would have an "e" node therein with a valid flag set. If characters such as "thj" were searched, however, the "t" node would transition to the next state which would have an "h" node therein, but the next state pointed to by "h" node would not include a "j" node, and thus this word would not be a valid word.
In Western scripts such as English, the top level state is generally from fifty to eighty nodes in length, while states other than the top level state are usually two to ten nodes in length. To search the top-level state, the nodes initially have been arranged alphabetically, and linearly searched from left to right. To speed the search, it is known to reorder the nodes in a states based on lexical frequency, i.e., in the top-level state, the "s" node comes before a "k" node since more words begin with the "s" character than the "k" character. This speeds the search because on the average, less nodes need to be visited before a match is found.
However, in Eastern scripts such as Chinese or Japanese, the top level state can be over ten-thousand nodes in length. Regardless of how ordered, the average linear search through so many nodes takes too long to be used in ordinary applications on ordinary computers. Nevertheless, many existing algorithms that operate on tries expect the states to be linear. If the nodes in a state are arranged in some other manner, the existing linear-based algorithms fail. In short, there has heretofore not been an adequate way in which to enumerate a trie for faster searching while preserving the characteristics of linear states for existing linear search algorithms.