A natural language parser is a program that takes a segment, usually a sentence, of natural language (i.e., human language, such as English) text as input and produces as output for that segment a data structure, usually referred to as a parse tree. This parse tree typically represents the syntactic relationships between the words in the input segment. Natural language parsers have traditionally been "rule-based." Such rule-based parsers store knowledge about the syntactic structure of a language in the form of syntax rules, and apply these rules to the input text segment to obtain the resulting parse tree. The parser usually stores information about individual words, such as what part-of-speech they can represent, in a dictionary or "lexicon," which is accessed by the parser for each word in the input text prior to applying the syntax rules.
Parsers apply rules in either a "top-down" or a "bottom-up" manner. In the following example, bottom-up parsing is described. To generate a parse tree, a bottom-up parser first creates one or more leaf nodes for each word of an input sentence. Each leaf node indicates a possible part-of-speech of the word. For example, the word "part" can be used as a noun or a verb part-of-speech. The parser then applies the syntax rules to generate intermediate-level nodes linked to one, two, or occasionally more existing nodes. Assuming that the parse is successful, eventually the parser will generate a single root node for a complete syntax parse tree that encompasses an entire sentence (i.e., include one leaf node for each word of the input sentence).
A bottom-up parser attempts to apply syntax rules one-at-a-time to single leaf nodes, to pairs of leaf nodes, and, occasionally, to larger groups of leaf nodes. If the syntax rule specifies that two certain types of nodes can be combined into a higher-level node and a pair of adjacent nodes match that specification, then the parser applies the rule to the adjacent nodes to create a higher-level node representing the syntactic construct of the rule. Each rule comprises specification and optional conditions. The specification indicates that certain types of syntactic constructs can be combined to form a new syntactic construct (e.g., "verb phrase=noun+verb"), and the conditions, if any, specify criteria that need to be satisfied before the rule can succeed (e.g., plural agreement of noun and verb). For example, the words "he see" represent a noun and a verb, respectively, which can be potentially combined into the higher-level syntactic construct of a verb phrase. The specification of "verb phrase=noun+verb" indicates that an intermediate-level verb phrase node linked to the two leaf nodes representing "he" and "see" can be created. However, the syntax rule may have a condition which indicates that the noun and verb need to be in agreement as to number (singular or plural). In this example, since "he" is not in plural agreement with "see," the syntax rule does not succeed. Syntax rules whose specifications match nodes of sub-trees are rules that can be potentially (assuming the conditions are satisfied) applied to create a higher-level node. As each new node is created, it is linked to already-existing leaf nodes and intermediate-level nodes, and becomes part of the total set of nodes to which the syntax rules are applied. The process of applying syntax rules to the growing set of nodes continues until a complete syntax parse tree is generated. A complete syntax parse tree includes all of the words of the input as leaf nodes and represents one possible parse of the input.
A typical parser uses a chart data structure to track the nodes that have been created. Each node is represented by a record that is stored in the chart. A parser would typically select each syntax rule and determine whether it can be applied to the records currently in the chart. If the rule can be applied, then the parser checks the conditions on each of the constituents of the syntax rule. If the conditions are satisfied, then the rule succeeds and the parser creates a new record and stores it in the chart. Each record, thus, corresponds to a sub-tree that may potentially be part of the complete syntax parse tree. When a record is added to the chart that encompasses all the words of the input sentence, then the tree represented by the record is a complete parse of the input sentence.
The parser can conduct an exhaustive search for all possible complete syntax parse trees by continuously applying the rules until no additional rules can be applied. The parser can also use various heuristic or statistical approaches to guide the application of syntax rules so that the rules that are most likely to result in a complete syntax parse tree are applied first. Using such approaches, after one or a few complete syntax parse trees are generated, the parser typically can terminate the search because the syntax parse tree most likely to be chosen as best representing the input is probably one of the first generated syntax parse trees. If no complete syntax parse trees are generated after a reasonable search, then a fitted parse can be achieved by combining the most promising sub-trees together into a single tree using a root node that is generated by the application of a special aggregation rule.
In one parser, the syntax rules are ordered by their probabilities of successful application. The probabilities used are based on syntactic analysis of a number of standard input sentences. The statistical ordering of syntax rules is described in U.S. patent application Ser. No. 08/265,845, entitled "Bootstrapping Statistical Processing," which is hereby incorporated by reference. The parser attempts to apply syntax rules in the order of their probabilities. In general, application of a great many less probable rules is avoided, saving the time of their application.
Although such parsers can theoretically generate all possible syntax parse trees for an input sentence, they have the serious drawback that, despite statistical rule ordering, the complexity of the generated intermediate parse trees grows exponentially with the length of the input sentence being parsed. This exponential growth can quickly exceed memory and response time constraints for a particular application program that uses the parser. When memory or response time constraints have been exceeded, and parsing is stopped, the parser may have failed to produce a parse tree that spans all of the words in the input sentence. In particular, the parser may have failed to parse certain portions of the input. Thus, the resulting parse tree is completely uninformative as to those portions that were not parsed.