For a computer to correctly interpret sentences in a natural human language, e.g., English, French, or Japanese, it must determine how the words in the sentences interact. This is done by representing the way sentences are constructed in the language as a set of grammar rules and using a parser that utilizes the grammar rules to derive a so-called parse tree that specifies the syntactic interaction between the words in a sentence. Once the interaction between the words in a sentence is known, the precise meaning of the sentence can be ascertained.
A grammar for a human language contains thousands and thousands of rules; however, only a handful of rules is applicable to any one sentence. To analyze a sentence, a parser has to compare each rule in the grammar with the sentence in order to determine which rules are applicable to the sentence and exactly where in the sentence each rule applies. The result of this analysis is represented as a parse tree, which shows what rules apply to each part of the sentence.
Due to the fact that grammars contain many rules and the fact that there are many places in a given sentence that a rule might apply, parsing is a time consuming process. In particular, current parsers are so slow that they place severe limits on the ability of computers to interpret human language.
Context-free grammars have been generally accepted for many years as the basis for parsing sentences in human languages. As discussed in standard textbooks on parsing, such as "Theory of Parsing, Translation and Compiling. Vol 1: Parsing", A. V. Aho and J. D. Ullman, Prentice-Hall 1973, the fastest context-free grammar parsers require a length of time k*n.sup.3 where k is a large constant factor and n.sup.3 is the cube of the number of words in the sentence being parsed.
It has been thought that parsing time can be dramatically decreased by converting a context-free grammar into a lexicalized form where every rule is directly linked to a word. The value of this is that a parser operating on a lexicalized grammar does not have to consider every rule in the grammar, but only the rules that are linked to the words in the actual sentence being parsed. This filtering of the grammar dramatically reduces the constant factor, k, and thus the time required to parse a sentence.
Heretofore, the most efficient lexicalized form for context-free grammar is lexicalized tree adjoining grammar. Unfortunately, as shown in "Tree-Adjoining Grammars and Lexicalized Grammars" by A. K. Joshi and Y. Schabes, in "Tree Automata and Languages", M. Nivat and A. Podeiski, editors, Elsevier, 1992, the fastest parsers for lexicalized tree adjoining grammar require a length of time k*n.sup.6. Even though the constant factor, k, is reduced, the increase in the exponent means that for typical sentences, parsing using lexicalized tree adjoining grammar takes thousands of times longer than parsing with context-free grammar. As a result, far from decreasing parsing time, conversion to a lexicalized form has heretofore only increased parsing time.
More specifically, the rules in a grammar for a language specify how different types of words, e.g. nouns, verbs, and adjectives, are grouped into phrases, such as noun phrases and verb phrases; and how phrases are grouped into sentences. A rule in a context-free grammar specifies how one or more words or phrases can be grouped into a phrase or sentence. The process of using a grammar to determine how the words in a sentence have been combined, is called `parsing`. A device that does parsing is called a `parser`. The output created by a parser is called a `parse tree`.
For example, consider the English sentence "The farmer loves the beach." Three key rules of English grammar are relevant to this sentence. First, a noun phrase can consist of a determiner, such as `the` or `a`, followed by a noun, such as `farmer` or `house`. Second, a verb phrase can consist of a verb, such as `loves` or `buys`, followed by a noun phrase. Third, a sentence can consist of a noun phrase followed by a verb phrase.
When given the above sentence and grammar rules, a parser creates a parse tree that specifies the structure of the sentence as follows: The sentence as a whole is composed of the noun phrase "the farmer" followed by the verb phrase "loves the beach". The noun phrase "the farmer" is composed of the determiner `the` followed by the noun `farmer`. The verb phrase "loves the beach" is composed of the verb `loves` followed by the noun phrase "the beach". The noun phrase "the beach" is composed of the determiner "the" followed by the noun "beach".
A grammar is said to be lexicalized if every rule in the grammar contains a specific word. The above grammar is not lexicalized because the third rule does not refer to words, but just to general types of phrases. Lexicalized grammars speed parsing, because the parsing process need only consider the rules that contain the words that are actually in the sentence being parsed.
To convert a context-free grammar into a grammar that is lexicalized and yet leads to the same parse trees, one must use rules that are more complex than the rules used in context-free grammar. The best way to accomplish this to date is to use adjoining rules. By an adjoining rule is meant a rule that adds a word into the middle of a phrase. For example, in English a noun phrase, such as "the farmer", can be extended by adding an adjective, such as `wise`, before the noun, yielding "the wise farmer". This is a simple case of adjoining in which a word is added on only one side, here the left, of the subphrase. Rules of this type are called non-wrapping adjoining rules.
In more complex situations, adjoining can add words both before and after a subphrase. This is called wrapping adjoining, because the subphrase is `wrapped` with added words on both sides. For example, there could be an adjoining rule that places words both before and after `farmer` in the phrase above. A key observation is that wrapping adjoining rules add two words or groups of words that do not end up next to each other in the phrase, because the subphrase is in between. This is in contrast to context-free rules in which all the added words are always next to each other.
The key differences between lexicalized tree adjoining grammar and context-free grammar is that lexicalized tree adjoining grammar allows adjoining rules and requires every rule to contain a specific word. Because lexicalized tree adjoining grammars are lexicalized, they can be dynamically filtered during parsing by considering only those rules that contain the words that are in the sentence being parsed. This has the potential of reducing parsing time, because it reduces the constant factor k.
However, because it allows wrapping adjoining rules lexicalized tree adjoining grammar is much more time consuming to parse with than context-free grammar. Since the items combined by a context-free rule must be next to each other, a parser only has to consider items that are next to each other when considering whether a context-free rule applies. However, since a wrapping adjoining rule can add items that are not next to each other, a parser must consider pairs of items that are not next to each other when considering whether an adjunction rule applies. This adds a great many more situations that have to be considered, and the parser is forced to operate in time k*n.sup.6. While k is reduced by using lexicalized tree adjoining grammar, the parsing time is dramatically increased, because the n.sup.3 time factor of context-free grammars goes to n.sup.6.