The invention relates generally to computer systems, and more particularly to an improved method and system for storing lexical data and attaching information thereto.
A trie is a data structure that is useful for compressing lexical data such as a list of dictionary words. Tries are composed of states, with a top-level state representing, for example, each of the first letters (e.g., a-z) of all valid words in a given dictionary. Each state is comprised of nodes, wherein each node represents a valid letter in that state, along with some information about that letter, such as a pointer to a lower state (if any). Each state represents a transition from one character in a word to the next. For example, the letter xe2x80x9cqxe2x80x9d in one state usually transitions to the letter xe2x80x9cuxe2x80x9d in a next lower state.
To use the trie, such as to find if a user-input word is a valid word in the dictionary, a search through the states is performed. For example, to find the word xe2x80x9cthe,xe2x80x9d the top-level state in the trie is searched until the xe2x80x9ctxe2x80x9d node is found, and then a next lower level state pointed to by the xe2x80x9ctxe2x80x9d node is searched to determine if there is an xe2x80x9chxe2x80x9d node therein. If not, the word xe2x80x9cthexe2x80x9d would not be a valid word in that dictionary. However, if there is an xe2x80x9chxe2x80x9d node in the state pointed to by the xe2x80x9ctxe2x80x9d node, the xe2x80x9chxe2x80x9d node is examined to find a next state, if any. The state pointed to by the xe2x80x9chxe2x80x9d node is then searched to find out whether there is an xe2x80x9cexe2x80x9d node therein. If there is an xe2x80x9cexe2x80x9d node, to be a valid word, the xe2x80x9cexe2x80x9d node needs to be followed by some indication (e.g., a flag) indicating that a valid word exists at this time, regardless of whether the xe2x80x9cexe2x80x9d node points to a further state. In a trie-structured dictionary that properly represents a list of words in the English language, xe2x80x9cthexe2x80x9d would be a valid word, and thus the top-level state would have a xe2x80x9ctxe2x80x9d node, the next state pointed to by the xe2x80x9ctxe2x80x9d node would have an xe2x80x9chxe2x80x9d node therein, and the state pointed to by that xe2x80x9chxe2x80x9d node would have an xe2x80x9cexe2x80x9d node therein with a valid flag set. If characters such as xe2x80x9cthjxe2x80x9d were searched, however, the xe2x80x9ctxe2x80x9d node would transition to the next state which would have an xe2x80x9chxe2x80x9d node therein, but the next state pointed to by xe2x80x9chxe2x80x9d node would not include a xe2x80x9cjxe2x80x9d node, and thus this word would not be a valid word.
While storing words in a trie structure is efficient in terms of both storage and access time, it is difficult to attach information to individual words in the trie. One known way to attach information to certain individual words stored in a trie is to tag selected words by setting a single xe2x80x9ctagxe2x80x9d bit in the last node of each selected word. Tagging is useful for identifying a small or regular subset of words for special processing upon decompression. For example, some words are slang words, which although acceptable (e.g., to a spell checker), are not recommended (e.g., by a thesaurus). If a trie is used to store words, the slang words can be tagged, whereby upon decompression, those words stand out from the rest. Then, the spell checker may ignore the tag, while the thesaurus may recognize the tag and thereby delete or change the appearance of the word in a list of synonyms presented to a user.
Another technique for associating information with words is known as global enumeration. Global enumeration is a technique that maps each word in the word list to a number and maps that number back to the same word, i.e., the number may be used to determine its associated word, and vice-versa. The numbers are dense, e.g. if there are N words in the list, then the words map to the range zero to N minus one. The number may serve as an index to information associated with specific words, which is useful if the same type of information is attached to every (or most) words in the list with little or no pattern. For example, the words in a thesaurus may be stored in a trie and enumerated, whereby the number associated with each word may serve as an index to a table of synonyms, a table of antonyms and so on. The tables themselves may be lists of numbers representing associated words that map back to the trie. By way of example, the user may want a synonym for a word that is enumerated in the trie as 957, whereby 957 is used as an index to a table of synonyms, resulting in the numbers 2040, 902 and 457 being retrieved. Those retrieved values are then used to find their corresponding words in the trie for display to a user.
While tagging and enumeration are thus helpful techniques, they are essentially limited to solving only their specific types of problems, i.e., marking certain words, or associating each of the words in a trie with a unique indexing number. Thus, these solutions work in certain circumstances, however there are many word lists that would benefit from having additional information stored with the word, and the existing techniques are neither flexible enough nor extensible to solve the problem in an efficient manner. For example, certain languages have gender associated with certain words, but not all words. Thus, a single bit is not sufficient to represent male, female or gender neutral. Separately tagging more than one subset of words can be done by setting aside an additional bit in each node for each additional subset, (e.g., one bit for gender or not, one bit for male or female), however reserving such tagging bits in each node reduces compression. While enumeration could be used to store the related gender information in an indexed table, enumeration requires the storing of numbers with the nodes, which in some instances is very inefficient, such as if enumeration is not otherwise needed and only a few words need such associated information.
Briefly, the present invention provides a method and system and accompanying data structure for the improved attaching of additional information onto words in a trie. The present invention is generally accomplished by providing a framework within the trie data structure capable of storing multiple tags with individual words, wherein some or all of the tags may further have associated values, and/or by separately enumerating some or all the subsets of tagged words (partial enumeration) in the trie, independent of whether global enumeration of all words is in use. To accomplish multiple tagging, the single tag bit on the last node of a word may be interpreted in a new way, as specified by information placed in a header of the trie. If set, it indicates that a further block of bits (e.g., a byte) is included in the node, which comprises a bitmask specifying which of a plurality of tags are set on that particular node. Header information may also specify which (if any) of the tags have associated values, which are then stored in association with each node having such a tag.
Partial enumeration of tagged items is provided by storing a count of the tagged words under a node. Multiple tags may be selectively and separately enumerated. Header information indicates how the enumeration is arranged, e.g., which of the plurality of tags are enumerated. Partial enumeration may be combined with global enumeration, with multiple tags, and/or with tags that have values, providing a flexible, extensible and efficient way to attach information to words in a trie.