A natural language parser is a program that takes a segment, usually a sentence, of natural language (i.e., human language, such as English) text as input and produces as output for that segment a data structure, usually referred to as a parse tree. This parse tree typically represents the syntactic relationships between the words in the input segment. The parse tree may also represent certain semantic relationships.
Natural language parsers have traditionally been "rule-based." Such rule-based parsers store knowledge about the syntactic structure of a language in the form of linguistic rules, and apply these rules to the input text segment in order to obtain the resulting parse tree. The parser usually stores information about individual words, such as what part-of-speech they can represent, in a dictionary or "lexicon," which is accessed by the parser for each word in the input text prior to applying the linguistic rules.
Such rule-based parsers have the disadvantage that they require extensive amounts of both dictionary data and rule-writing labor by highly skilled linguists to create, enhance, and maintain. This is especially true if the parser is to have "broad coverage," that is, if it is to be able to parse "general" natural language text of many different types.
Recently, there has been increasing activity focused on using statistical methods to acquire information from large, annotated corpora of natural language text, and on using that information in statistical natural language parsers. Instead of being stored in the traditional form of dictionary data and linguistic rules, this information is represented as statistical parameters, or probabilities. These probabilities are usually then used in parsers together with simpler dictionary data and rules, thereby taking the place of much of the information created by skilled labor in rule-based systems.
Such a statistical parser is initially incapable of parsing "raw" input text. The statistical parser is first operated in a training mode, in which it receives input strings that have been annotated by a linguist with tags that specify parts of speech, syntactic function, etc. The statistical parser records statistics reflecting the application of the tags to portions of the input string. As an example, an input string such as the following might be used: EQU I perform parses.
A linguist would then add tags to the input string to produce the following tagged input string: EQU sentence(pronoun(("I") verb.sub.-- phrase(verb("perform") noun ("parses")))
When the above tagged input string is submitted to the statistical parser in its training mode, it would adjust its statistics to the effect that each of the following parsing steps is more likely to be successful: "perform".fwdarw.verb, "parses".fwdarw.noun, verb noun.fwdarw.verb.sub.-- phrase, "I".fwdarw.pronoun, pronoun verb.sub.-- phrase.fwdarw.sentence. After a significant amount of training using tagged input strings, the statistical parser enters a parsing mode, in which it receives raw, untagged input strings. In the parsing mode, the statistical parser applies the statistics assembled in the training mode in order to attempt to build a parse tree for the untagged input string.
The advantages of statistical parsers over rule-based parsers are in decreasing the amount of rule-writing labor required to create a high-quality parser and in being able to "tune" a parser to a particular type of text simply by extracting statistical information from the same type of text. The disadvantage of this approach is that a large body, or corpus, of natural language text is required that has been laboriously tagged.
There has been some discussion and work in the area of creating hybrid natural language processing systems that make use of both traditional rules and data as well as statistical methods for acquiring the linguistic knowledge required. According to a first hybrid approach, statistical methods are either (1) applied to a large tagged corpus or (2) used to model the linguistic accuracy of a parse structure as determined by human interaction. In the first case, the information obtained is used in a separate preprocessing step to select the parts-of-speech for words before parsing with a rule-based parser. In the second case, the information is used to determine the most likely syntactic parse or semantic interpretation after a rule-based parser has produced multiple alternatives. In neither case is the information actually applied during operation of the parser.
In a second approach, a rule-based parser is not used at all, but rather, traditional linguistic knowledge is used to determine, for example, the possible parts-of-speech for words, thus allowing words in untagged corpora to be grouped according to their possible parts-of-speech. Statistical methods are then applied over these groups, rather than over the words themselves, in order to obtain higher-level bigram and trigram language models that approximate the syntactic structure of each input string and that will be used later by a statistics-based parser. While these language models are indeed representative at some level of the input strings from which they were derived, they are still generally not as structurally rich and descriptive as the parse trees obtained by rule-based parsers.