The present invention relates to natural language or text processing. More particularly, the present invention relates to an improved data structure for storing a lexicon and methods of constructing and using the same.
Natural language or text processing encompasses many types of systems or applications: word breaking such as for search engines, grammar and spell checking, handwriting and speech recognition, machine translation, text mining, and the like. A common and important component of many natural language processing systems and applications is one or more lexicons.
Generally, the lexicon is a data structure containing information about words, which fall into different types. Word types include base words (or “lemmas”), inflections, and derivatives. Lemmas generally include the simplest form of a word such as “jump” on which other types of words are inflected or derived. A lemma differs from a word stem in that a lemma is a complete word but a word stem is not necessarily so.
Inflections are alternate or inflected forms of a word, typically the lemma, which add affixes (prefixes or suffixes), or that reflect grammatical features such as number, person, mood, or tense. Hence, “jumps,” “jumping,” and “jumped,” and the like are inflections of the lemma “jump.” Derivatives are words that are formed from another word by derivation. Thus, “electricity” is a derivative of “electric.”
A lexicon can also contain syntactic and semantic information. Syntactic information relates to syntax rules by which words are combined into grammatically correct phrases or sentences. Thus, syntactic information for a word can include whether the word is a noun, verb, adjective, etc. and can include its relationship to one or more other words in the same sentence, such as a subject-verb or verb-object relationship. In contrast, semantic information conveys meaning. Word meaning can include a definition, gender, number, and whether a word is a named entity such as first name, last name, city name, etc. There is some overlap between syntactic and semantic information. For example, number such as singular or plural and gender convey both meaning and are used in accordance with certain syntax rules.
Additionally, a lexicon can contain information useful for the particular type of language processing. For example, information including a word and its segmentation can be stored to aid, for example, a word breaking application. Other syntactic and/or semantic information can be stored to aid other language processing systems such as querying, grammar checking, or spell checking.
Generally, there is a trade-off between computing speed and the amount and detail of information stored in the lexicon. Thus, for example, in a word breaking application, computing speed increases when the lexicon already stores detailed information on various inflections and derivatives of each encountered lemma. Computing speed decreases when the word breaker must systematically break down a word in a query to generate, for example, lemmas and inflections from a queried word.
In operation, a natural language processing system can receive an input word or string of words and access stored information in the lexicon to process the word or words according to system parameters. For example, a search or data retrieval engine using an expansive word stemming system can receive a query such as “dogs” and retrieve from a lexicon stored associated terms (e.g. compounds, lemmas, inflections, derivations, synonyms, named entity, etc.) such as “hounddog,” “dog,” “dogged,” “Collie,” or “Lassie.” Alternately, a received query can be input as “dogs,” “dogged,” etc. and the system accesses a lexicon to retrieve the lemma “dog.” Such word generation or collapse can be used to broaden (or narrow) a word search depending on system parameters.
Another system such as a grammar or spell checking system could receive a word string such as “He eat a hptdg” and access information stored in a lexicon to correct the sentence to “He eats a hot dog.” Likewise, systems such as handwriting and speech recognition, machine translation, text mining, and similar systems can access stored information in the lexicon for further processing according to system parameters.
A lexicon that can be used or adapted to multiple natural language or text processing systems, especially a lexicon that is efficiently stored, easily accessible, and that can be updated would have significant utility.