The present invention relates to language or text processing. More particularly, the present invention relates to an improved data structure for storing a lexicon and an improved method of using the same.
Language or text processing encompasses many types of systems. For instance, parsers, spell checkers, grammar checkers, word breakers, natural language processors or understanding systems and machine translation systems are just a few of the types of systems that fall within this broad category.
A common and important component of many language or text processing systems is the lexicon. Generally, the lexicon is a data structure containing information about words. For instance, the lexicon can store indications of syntactic and semantic information. Examples include whether the word is a noun, verb, adjective, etc. In addition, different types of linguistic information can also be stored in the lexicon. Often it is also helpful to store other information useful for the particular type of language processing such as storing information about the word that aids in parsing. In yet other lexicons, indications as to whether the word is a proper name, geographical location, etc. can be useful.
In operation, after receiving an input string of words, the language or text processing system accesses the lexicon to obtain the stored information with respect to each of the words. Having gathered the information about each of the words in the input string, the language or text processing system processes the input string, which can include resolving any ambiguities that may exist based on the word information. For instance, in a natural language processing system, the lexicon assigns parts of speech to each of the words in the input string. A syntactic parser then decides which of the parts of speech assignments are appropriate and builds a structure from the input string, which can then be passed along to a semantic component for interpretation.
Commonly, each entry in the lexicon comprises a single binary large object. Although the information is accessible, this format does not readily allow localized access to commonly used lexical information without having to read in the complete entry. If all of the information pertaining to a word entry must be read in from the lexicon, more memory and processing time is required, particularly if only a small part of the information for the word entry is needed.
Modifying or adding to the lexical information is also difficult. Specifically, in order to modify or add further information to the lexicon, the lexicon author must replicate all of the bits, attributes or other information within each entry, then modify the desired information or add to it, while maintaining the integrity and organization of a very complex data structure.
There thus is a need for an improved lexicon data structure that addresses one, some or all of the disadvantages discussed above.