A natural language parser is a program that takes a span, usually a sentence, of natural language (e.g., English) text as input and produces as output for that span a data structure, usually referred to as a parse tree. This parse tree typically represents the syntactic relationships between the words in the input span. In other words, the parse tree typically represents a sentence-like construction of the input span.
Natural language parsers have traditionally been rule-based. Such rule-based parsers store knowledge about the syntactic structure of the natural language in the form of syntax rules, and apply these rules to the input text span to obtain the resulting parse tree. The parser usually stores information about individual words, such as what parts-of-speech they can represent, in a dictionary or "lexicon". The dictionary or lexicon is accessed by the parser for each word in the input span prior to applying the syntax rules.
To generate a parse tree, a conventional parser first creates one or more leaf nodes for each word of the input span. Each leaf node indicates a possible part of speech of the word. For example, the word "part" can be used as a noun or a verb part-of-speech. A leaf node contains a single word and its associated part of speech. An intermediate-level node contains leaf nodes as its basic elements. Adjacent nodes are leaf nodes or intermediate-level nodes that are adjacent to one another in the input span. The parser applies the syntax rules to generate intermediate-level nodes linked to one, two, or occasionally more existing nodes. Assuming that the parse is successful, eventually the parser will generate a single node for a complete syntax parse tree that encompasses an entire sentence (i.e., include one leaf node for each word of the input sentence).
A conventional parser attempts to apply syntax rules one-at-a-time to single nodes, to pairs of nodes, and, occasionally, to larger groups of nodes. If a syntax rule specifies that two certain types of nodes can be combined into a higher-level node and a pair of adjacent nodes match that specification, then the parser applies the rule to the adjacent nodes to create a higher-level node representing the syntactic construct of the rule.
A typical parser uses a node chart data structure to track the nodes that have been created. Each node is represented by a record that is stored in the node chart. A parser typically determines whether each syntax rule can be applied to the records currently in the node chart. If a rule succeeds, the parser creates a new record. Each record, thus, corresponds to a sub-tree that may potentially be part of the full-sentence syntax parse tree. When a record that encompasses all the words of the input sentence is promoted to the node chart, then the tree represented by the record is a full-sentence parse of the input sentence.
The parser can conduct an exhaustive search for all possible full-sentence syntax parse trees by continuously applying the rules until no additional rules can be applied. The parser can also use various heuristic or statistical approaches to guide the application of syntax rules so that the rules that are most likely to result in a full-sentence syntax parse tree are applied first. Using such approaches, after one or a few full-sentence syntax parse trees are generated, the parser typically can terminate the search because the syntax parse tree most likely to be chosen as best representing the input is probably one of the first generated syntax parse trees. If no full-sentence syntax parse trees are generated after a reasonable search, then a fitted parse can be achieved by combining the most promising sub-trees together into a single tree using a root node that is generated by the application of a special aggregation rule.
Although such parsers can theoretically generate all possible syntax parse trees for an input sentence, they have the serious drawback that the complexity of the generated intermediate parse trees grows exponentially with the length of the input sentence being parsed. This exponential growth can quickly exceed memory and response time constraints for a particular application program that uses the parser. When memory or response time constraints have been exceeded, and parsing is stopped, the parser may have failed to produce a parse tree that spans all of the words in the input sentence. In particular, the parser may have failed to parse certain portions of the input. Thus, the resulting parse tree is completely uninformative as to those portions that were not parsed.
In one parser, a process is used to determine the likelihood that a certain syntax rule, when applied to a partial parse of an input span, will produce a node that will be part of the correct parse for the input span. This approach is used to guide the search through the space of possible parses toward those constructions that have the highest likelihood of producing the best parse, by producing a goodness measure for each node produced. This parser implements a "pruning" process to reduce the parsing time of the conventional parser. By integrating the goodness measure concept into the conventional parser, this parser reduces the time needed to parse the sentence by performing a non-exhaustive parse.
In conventional parsers, the mechanism by which nodes are added to the node chart involves a candidate list. A new node is produced by the application of a syntax rule to the nodes already in the node chart and is placed on the candidate list. In the pruning parser, the newly created node is assigned a goodness measure as it is placed on the candidate list. When it is time to promote a new node to the node chart, the candidate list is searched for the node with the highest goodness measure. That node is promoted to the node chart and used, along with neighboring nodes and the syntax rules, to generate additional nodes for the candidate list.
Two basic problems are faced by any node chart parser: generating a complete parse tree quickly, and generating the correct parse. The first problem consists of the time and resource constraints any practical system must impose on the search for the correct parse. Any parser could simply generate all possible parses, then use a goodness measure to choose the best parse, but this approach, will consume an impractical amount of memory for the node chart and time for the parse. Thus, a non-exhaustive search of the space of possible parses for a sentence is needed, preferably one that creates as close to the minimum possible number of nodes as possible.
The other problem consists of actually generating and identifying the correct parse. Even given exhaustive parsing, there is no guaranteed way of identifying the correct parse. With pruning of the search space (i.e., implementing shortcuts), most possible parses are never even generated. Conventional parsers guide the search generally in the direction of the correct parse, but by no means in all cases is the correct parse the first parse found. Due to time and resource constraints, parsing continues for only a limited number of nodes added to the node chart after the first parse is found. If the first parse is not the correct parse, there is no guarantee that the correct parse will be found within the limited amount of additional searching.
Therefore, there is a need for an improved natural language parser that searches the space of possible parses and maximizes the probability of finding the correct parse, and does so quickly by building as few nodes as possible.