The present invention relates to natural language processing. In particular, the present invention relates to syntactic parsing of text.
A natural language parser is a program that takes a text segment, usually a sentence, of natural language (i.e., human language, such as English) and produces a data structure, usually referred to as a parse tree. This parse tree typically represents the syntactic relationships between the words in the input segment.
In general, natural language parsers build the parse trees by applying syntax rules to the input text segment. Parsers apply these rules in either a “top-down” or a “bottom-up” manner.
In a bottom-up parser, all of the possible parts of speech for the individual words of the input text are first identified to form a set of word tokens. The parser then attempts to combine the individual word tokens into larger syntactic structures, such as noun phrases and verb phrases, by applying syntax rules to the tokens. The resulting larger structures represent candidate nodes for the parse tree. The parser continues to try to build larger and larger structures by applying syntactic rules to previously identified candidate nodes. A full parse is achieved when a node spans the entire text segment.
Many syntax rules encode grammatical relations between the nodes that they combine into a larger segment. For instance, in the sentence “Joe likes Mary”, the syntax rule which combines the verb phrase “likes” with the noun phrase “Mary” to form a larger verb phrase will also identify the noun phrase as the direct object of the verb phrase. However, relationships between nodes that do not appear near each other are typically not identified in the parse. For instance, in the question “Who does Joe like?”, the word “who” is interpreted as the direct object of the verb “like”. However, because “who” is separated from “like” by the words “does Joe”, most syntactic parsers would typically not identify the non-local relationship between “who” and “like”.
In extreme cases, the distance that may be between words that are involved in a non-local relationship is unlimited. Such relationships are known as unbounded dependencies.
Although a valid syntactic parse can be formed without identifying these non-local relationships, the relationships must be identified when constructing a representation of the argument structure, or logical form, for the text. As is well known in the art, a logical form is a more generalized version of the syntactic parse that shows the basic argument structure of the text without being affected by how the components of the argument structure are expressed in the text. Thus, the sentence “I bought the book” and “The book was bought by me” would have the same logical form but different syntactic parses.
In most systems, logical forms are constructed by identifying relationships within clauses of the text, and between any clauses and the clauses that are subordinate to it. Thus, relationships that hold between one clause and a superordinate clause or between an element in one clause and an element in some superordinate clause present an exception to normal logical form processing. To deal with these non-local relationships, the logical form systems must implement special rules that search for relationships that extend beyond clause boundaries.
To overcome this problem, some prior art syntactic parsers introduce empty elements after words in sentences that can be in non-local relationships and that are missing a neighboring word needed to complete a relationship locally. Each empty element is indexed and this index is passed upward in the parse tree as larger structures are built with the empty element. Eventually, the index is used by a rule to link the empty element with the word or phrase that the empty element represents in the parse.
One problem with using empty elements is that it greatly increases the number of hypotheses that must be considered during syntactic parsing, because each empty element represents an additional word that must be parsed.
Thus, a syntactic parser is needed that can identify non-local relationships in an input sentence without increasing the number of hypotheses generated during a parse.