The present invention relates to natural language processing. In particular, the present invention relates to parsing natural language text.
A natural language parser is a program that takes a text segment, usually a sentence, of natural language (i.e., human language, such as English) and produces a data structure, usually referred to as a parse tree. This parse tree typically represents the syntactic relationships between the words in the input segment.
In general, natural language parsers build the parse trees by applying syntax rules to the input text segment. Parsers apply these rules in either a “top-down” or a “bottom-up” manner.
In a bottom-up parser, all of the possible parts of speech for the individual words of the input text are first identified to form a set of word tokens. The parser then attempts to combine the individual word tokens into larger syntactic structures such as noun phrases, and verb phrases by applying syntax rules to the tokens. The resulting larger structures represent candidate nodes for the parse tree. The parser continues to try to build larger and larger structures by applying syntactic rules to previously identified candidate nodes. A full parse is achieved when a node spans the entire text segment.
The performance of a parser is based on its speed and its accuracy. Very accurate parsers can be formed by adopting exhaustive search strategies that build all of the possible full parse trees before identifying a “best” parse tree.
Although exhaustive-search parsers are accurate, they are also slow. To make the parse faster, the prior art has developed various techniques that prioritize the order in which nodes are formed during parse tree construction. The goal of these techniques is to form the correct parse while generating a minimum number of intermediate candidate nodes. Ideally, all of the candidate nodes that are formed would eventually be found in the final parse tree.
One prioritizing technique involves ordering the rules that are applied to the nodes and tokens so that rules with high probabilities of forming part of the final parse are applied before rules with lower probabilities. Other techniques order the tokens and candidate nodes based on some metric or “goodness measure” that indicates the likeliness that the node or token will appear in the final parse. The nodes or tokens that are higher in the list are used to form larger nodes before the nodes and tokens that are lower in the list.
Examples of metrics used to order the nodes include heuristic scoring techniques. These techniques assign starting scores to each of the tokens and provide some formula for generating scores for the larger nodes based in part on the scores of the tokens and intermediate nodes below the larger nodes. Other metrics include simple statistical metrics that count how frequently a node of a particular type, such as a verb phrase or noun phrase, appears in parse trees formed from a training corpus.
In one prior art technique, a more advanced statistical metric is used that utilizes more than just the node type when determining the probability of the node. In this technique, the headword of the node (the word that carries the focus of the segment spanned by the node), the phrase level of the node (the relative complexity of the phrase), and the syntactic history (such as whether or not the node is passive) are used to further divide the probability space so that the probability associated with a node better describes its actual probability of occurring in a parse tree.
Although these techniques for guiding the search for a parse tree have reduced the time needed to form the parse tree, there is an ongoing need to further reduce the parse time.