1. Field of the Invention
The present invention relates generally to natural language processing and more particularly, to a parser and a method for parsing an input string.
2. Description of Related Art
Parser and parsing methods perform an automatic syntax analysis of an input string in order to assign structural descriptions to the input string. A parser analyses input strings based on a predefined set of principles and rules with the aim of establishing a rule-compliant (correct) sentence structure.
The inner structure of a sentence may be defined using different grammar formalisms. A grammar formalism defines the inner structure of a language using rules and principles. Examples for grammar formalisms are GPSG (generalized phase structure grammar) or HPSG (head-driven phase structure grammar). A grammar formalism like GPSG describes the syntax structure of a natural language using rules and feature constraints.
A syntax structure of an input string may be represented in the form of a tree. Constituents within such a tree may be a noun phrase (NP), a verbal phrase (VP), a prepositional phrase (PP), and an adjective phrase (AdjP). Constituents may represent elements or larger units of elements of an input string. An example for a noun phrase (NP)is:
Det+N and an example for a verbal phrase is:
Verb+NP+PP.
A parser performs a syntax analysis of an input string based on a set of predefined rules. A rule replaces subsequent elements or constituents of an input string by another constituent. A well known type of rules are sequence rules. A sequence rule has the form of:
A→B1 . . . Bn
The number of elements to be replaced is represented by Bi and the new symbol on a higher level is represented by A. The consecutive elements B1 . . . Bn are only replaced by A when elements Bi are present in the predefined order of 1 . . . n.
rules cannot express generalizations which are only implicitly included in these rules. Newer rules, introduced by grammar formalisms like GPSG separate between dominance relations and precedence relations. Dominance rules (immediate dominance rules—ID rules) only define dominance relations between a higher level unit and the lower level elements. A dominance rule describes lower level elements in any order. A particular order of lower level elements is defined by precedence rules (linear precedence rule—LP rules). The use of ID/LP rules may replace a plurality of sequence rules, which would be necessary to describe the same definition.
The syntax analysis of a parser is generally based on one of three different strategies, namely using a feature-value grammar, a LR-grammar, or a finite-state transducer. The most common strategy is based on feature-value grammars, especially with chart parsers. Such grammars are usually stored in one single space. The parser has to find out the best rule to be applied on a sequence of categories of an input string. It is a drawback of feature-value grammars that the employed strategies are very complex and require a huge computational effort. On the other hand, these grammars provide a good readability of the formalism describing grammatical events and has the ability to process rich lexical information.
The LR grammars are another type of strategy commonly used. This strategy is based on a stack algorithm combined with decision tables. The rules of this grammar utilize specific operators for combining or skipping sequences of categories. These grammars usually result in a fast and more economic automatic syntax analysis. On the other hand, these grammars are usually difficult to read and to maintain. In addition, the scope of these grammars is usually limited to categories that are next to each other in a sequence and long range dependencies are difficult to implement.
The third strategy is based on finite-state transducers which are combined in order to described certain sequences of categories. Each transducer applies to a sequence of categories in a predefined order. Although this approach allows fast and robust parsers, it is difficult to exploit fine-grained lexical information and to express complex phenomena in a readable way.