The present invention relates to natural language processing. In particular, the present invention relates to syntactic parsing of text.
A natural language parser is a program that takes a text segment, usually a sentence, of natural language (i.e., human language, such as English) and produces a representation of the syntactic structures in the sentence. One common representation is a parse tree that represents the syntactic structure in a hierarchical manner with leaf nodes that represent the individual words of the text segment and a root node that spans the entire text segment.
In general, natural language parsers build the parse trees by applying syntax rules to the input text segment. Parsers apply these rules in either a “top-down” or a “bottom-up” manner.
In a bottom-up parser, all of the possible parts of speech for the individual words of the input text are first identified to form a set of word tokens. The parser then attempts to combine the individual word tokens into larger syntactic structures, such as noun phrases and verb phrases, by applying syntax rules to the tokens. The resulting larger structures represent candidate nodes for the parse tree. The parser continues to try to build larger and larger structures by applying syntactic rules to previously identified candidate nodes. A full parse is achieved when a node spans the entire text segment.
During the construction of the nodes, attribute-value pairs that describe the structure represented by the node are created. For example, a first token attribute and a last token attribute are associated with each node to indicate the position in the input string of the first token that the node spans and the position in the input string of the last token that the node spans. Additionally, each node has a “head” attribute that designates the primary element of the phrase represented by that node, a “prmods” attribute that designates the (potentially empty) list of modifiers found before the head in the phrase, and a “psmods” attribute that designates the (potentially empty) list of modifiers found after the head in the phrase. The number and type of attributes that are associated with a node is unlimited and is controlled by the rule used to form the node.
The computational complexity of forming the parse is a function of the number of candidate nodes that are formed. To limit the number of candidate nodes, some systems adopt a minimal attachment strategy that prevents certain candidate nodes from being formed if other candidate nodes have already been formed or are expected to be formed.
Although this minimal attachment strategy reduces the complexity of forming an initial parse structure, it can result in parse trees that are less than optimal. To address this, many parsing systems re-examine the initial parse trees to determine if each tree can be changed to provide a better parse.
The goal of such systems is to provide a single improved parse tree for each initial parse tree. Thus, even though an initial parse tree could be modified in several different ways, parsing systems of the prior art have been limited to providing only one modified parse tree for each initial parse tree.
Such systems are not desirable because the syntactic rules used to identify an improved parse have a limited ability to resolve syntactic ambiguities in the initial parse. As a result, the best parse for the sentence may not be produced by the syntactic parser. Thus, a system is needed that provides better syntactic parses of sentences.