The following relates to the linguistic arts. It finds particular application in conjunction with automated natural language processing for use in diverse applications such as electronic language translators, grammar checkers for word processors, document content analyzers, and so forth, and will be described with particular reference thereto. However, it is to be appreciated that the following is also amenable to other like applications.
Natural language processing is typically performed in three distinct processing layers: a lexical processing layer, a syntactical processing layer, and a semantic processing layer. At the lexical stage, the linguistic input is broken into base constituent parts, typically including words and punctuation. Each word, punctuation mark, or other element is typically referred to as a token. At the lexical layer, an attempt is made to associate each word or token with lexical information contained in a lexicon. The lexicon includes morpho-syntactic information, semantic information, and associated parts of speech. Such token association at the lexical stage is referred to as morphological analysis. The lexical layer generally operates on tokens individually, without taking into account the surrounding context, that is, the surrounding tokens. Accordingly, there is often substantial ambiguity remaining after the lexical processing. For example, the token “fly” in the English language could represent a noun indicative of an insect, or it could represent a verb indicative of aerial movement. Moreover, it could be part of collocation such as “fly wheel” indicative of a mechanical device, or “fly by” indicative of an event-involving an aircraft flying overhead.
At the syntactical layer, the tokens are processed with consideration given to contextual information. Thus, for example collocations are identified by recognizing the paired tokens (such as “fly” followed by “wheel”), and this additional contextual information is employed to narrow the word morpho-syntactic analysis and part of speech. The syntactical processing is sometimes broken down into a disambiguation level that takes into account the word definitions, and a context-free grammar level that takes into account syntactical categories (such as looking at sequences of parts of speech or higher level constituents) without otherwise considering word meaning. Such a grammar is sometimes referred to as an augmented context-free grammar. The grammar is usually described by rewriting rules. Each rewriting rule associates a higher level constituent with an ordered sequence of lower level constituents.
The rewriting rules can generally be employed in a “top-down” analysis or a “bottom-up” analysis, or in some combination thereof. In a top-down approach, the overall form of the ordered sequence of tokens making up the linguistic input is analyzed to break the sequence down into successively lower level constituents. For example, starting with a sentence (S), a rewriting rule S→NP VP is used to identify a noun part (NP) and a verb part (VP) based on the overall form of the sentence. The NP and VP are high level constituents that are in turn broken down into lower level constituents such as parts of speech.
In a bottom-up approach, individual tokens are grouped to identify successively higher level constituents. For example, the token “the” tagged as an article (ART) followed by the token “dog” tagged as a noun (N) is grouped using a rewriting rule NP→ART N to identify “the dog” as a noun part (NP) constituent. The noun part may then in turn be grouped with a verb part (VP) according to rewriting rule S→NP VP to identify a sentence (S) constituent.
Some syntactical processors employ recursive analysis. Consider the sentence: “I have answered the inquiry.” which contains a past participle “answered”. The lexical analysis identifies a token “have” and the token “answered”. Because the lexical analysis does not consider context, the token “have” is ambiguous, as it could be for example a verb or an auxiliary verb. The token “answered” is also ambiguous, and may be either an adjective or a past participle. It is assigned an appropriately ambiguous category such as “ADJORPAP”. At a first pass through the syntactical level, the ordered combination of “have” followed by a token of category “ADJORPAP” is recognized as a past participle form, and so “have” is categorized as an auxiliary verb and “answered” is categorized as a past participle. On a second pass through the syntactical level, a context-free re-writing rule recognizes the ordered combination of the auxiliary verb “have” followed by a past participle as a present perfect tense verbal constituent. Such recursive syntactical processing reduces the computational efficiency and speed of the syntactical layer.
Another problem arises with the use of proper names. For example, consider the proper name “Bankunited Bancorp”. It would be desirable to recognize this as the proper name of a bank; however, at the lexical level the tokens “Bankunited” and “Bancorp” are unlikely to be included in the lexicon unless the named bank is a large national or international bank. If the lexicon does not contain these tokens, then the lexical level will be unable to assign morpho-syntactic information, semantic information, or parts of speech to the tokens “Bankunited” and “Bancorp”. The subsequently performed syntactical level will also be unable to assign meaning to these tokens, except that possibly their status as noun parts of speech may be guessed based on the surrounding context. Similar problems arise in other higher level constituent classes whose members are not readily exhaustively cataloged in the lexicon, such as chemical names, personal names, and so forth.
The following copending, commonly assigned applications: Bililngual Authorizing Assistant for the “Tip of the Tounge” Problem (Xerox ID 20040609-US-NP, Ser. No. 11/018,758 filed Dec. 21, 2004); and Retrieval Method For Translation Memories Containing Highly Structured Documents (Xerox ID 20031674-US-NP, Ser. No. 11/018,891 filed Dec. 21, 2004) are herein incorporated by reference.