Automated natural language (NL) text processing typically refers to text processing, such as text retrieval performed on text by a computer capable of "reading" and "understanding" the semantics of the text. Efficient natural language processing systems can be of great benefit in performing tasks such as information retrieval. The computer, by being able to understand the meaning, i.e., semantics, of the text, can perform a more accurate search and bring only relevant information to the attention of the requestor.
In order to perform such "intelligent" searches, the computer itself must "understand" the text. Natural language processing systems therefore typically contain tools, or software modules, to facilitate generating a representation of an understanding of the text. Particularly, when text is input to a NL system, the system not only stores the text but also generates a representation, in a computer-understandable format, of the meaning, i.e., semantics, of the text.
For generating a computer-understandable semantic representation of text, natural language processing systems include, in general and at a high level, a lexicon module and a processing module. The lexicon module is a "dictionary", or database, containing words and semantic knowledge related to each word. The processing module typically includes a plurality of analyzer modules which operate upon the input text and the lexicon module in order to process the text and generate the computer-understandable semantic representation. Particularly, the processing module generates a recorded version for each word of text, and the recoded version includes fields which represent semantic knowledge. Once this semantic knowledge of the text is generated in a computer-understandable format, a system user can use the computer, via an application program such as a search program to perform tasks such as text retrieval.
Most problems in natural language processing, e.g., information retrieval, database generation, and machine translation, hinge on relating words to other words that are similar in meaning. Because of the extreme difficulty of producing any accurate deep-level analysis of text, many of these strategies are inherently word-based. In the case of information retrieval, current methods match words in a query with words in documents, with the degree of match weighted according to the frequency of words in texts. In database generation, programs map individual words into names of frames or database records. In language translation, systems use mappings between words in "source" language and words in a "target" language to guide lexical choice (word choice). In all these applications, current methods are limited in their accuracy by the fact that many words have multiple senses, although different words often have similar meanings.
This problem is generally referred to as lexical inadequacy. Problems related to lexical inadequacy include the issue of genuinely ambiguous words as well as vague terms and derivative words, i.e., words that have a common root but vary slightly in meaning. Previous approaches to the problem of lexical inadequacy fall into two basic categories-word-based approaches and deep-level approaches. Word-based approaches have addressed the problem in several ways, including using co-occurrence and other contextual information as an indicator of text content to try to filter out inaccuracies, using word roots rather than words by stripping affixes, and using a thesaurus or synonym list that matches words to other words. Deep-level approaches can be more accurate than word-based approaches, but have not been sufficiently robust to perform any practical text processing task. This lack of robustness is generally due to the difficulty in building knowledge bases that are sufficient for broad-scale processing.