1. Technical Field
This invention relates to the field of natural language understanding, and more particularly, to including grammars within a statistical parser.
2. Description of the Related Art
Natural language understanding (NLU) systems enable computers to understand and extract information from human written or spoken language. Such systems can function in a complimentary manner with a variety of other computer applications where there exists a need to understand human language. NLU systems can extract relevant information contained within text and then supply this information to another application program or system for purposes such as booking flight reservations, finding documents, or summarizing text.
Currently within the art, NLU systems employ one of two different methods for extracting information from text strings, where a text string refers to a single sentence or other grouping of words. The first method is a linguistic approach to parsing text strings. The most common linguistic approach to NLU makes use of only a context free grammar, commonly represented within the art using Backus-Naur Form (BNF) comprising terminals and non-terminals. Terminals refer to words or other symbols which cannot be broken down any further, whereas non-terminals refer to parts of speech or phrases such as a verb phrase or a noun phrase. Thus, the grammatical approach to NLU seeks to parse each text string based on BNF grammars without the use of statistical processing. Potential ambiguities within text strings, where a terminal can be construed as belonging to more than one non-terminal, must be resolved within the grammar. For example, the NLU system can group a tag for a terminal at either the end of a previous non-terminal, or alternatively, as the start of another non-terminal.
Accordingly, to understand a text string, the NLU system requires the text string to be a priori built into the BNF grammar. In other words, the NLU system requires a BNF grammar be written that contains rules which generate the text string. The disadvantage of using a purely grammatical approach is the large amount of linguistic expertise needed write sufficient, yet unambiguous, BNF grammars. Moreover, to resolve ambiguities, the system must examine the multiple parse trees ambiguously produced by the grammar and use auxiliary information to select the correct parse. Consequently, building a grammar-based NLU system can be time consuming, inefficient, and further require experts.
Accommodating the many possible BNFs for text strings can become further complicated because particular phrases such as dates can be expressed in a variety of differing syntaxes. For example, each of the strings “Dec. 4, 2000”, “Dec. 4, 2000” and “4th of December 2000” represents an equivalent date. The grammar must not only contain a grammatical representation of the text string, but also contain additional rules for each permutation of possible dates within the text string. For example, taking the text string “I want to fly on Monday, Dec. 4, 2000”, many variations can be obtained by inserting a different date or by using a different date syntax. Thus, an ambiguity can arise. One example of such an ambiguity can be the NLU system interpreting the “2000” as both a time of day expressed in military time, and as the year of a date. The NLU system can never avoid all ambiguities with the proper BNF. For example, the “2000” might always be determined to be a year if immediately following a month and a day in a sentence, even though the BNF proposed that this could be a separate date and time. Notably, the grammars become increasingly complex as sentence structure becomes more intricate. Accounting for each possible permutation of well formed, or grammatically and syntactically correct, date expressions further contributes to the problem. Similar ambiguities and situations also can arise with regard to times, prices, dollar amounts, percentages, and prefix expressions such as “may I please have” or “can you give me”.
The second method used by NLU systems to extract information from text strings is a statistical approach where no grammar is used in analyzing the text string. Presently such systems rely on a large corpus of annotated sentences. These annotated sentences are collected into a training corpus. One can alternatively use a Monte Carlo method to generate training sentences. Using the Monte Carlo method, the NLU system can generate many possible permutations of ambiguities, or conflicts, in order to statistically train the NLU system. From the large amount of training data, the system can build a statistical model, for example a decision tree or maximum entropy model, to analyze text strings. The analysis performed using the decision tree is not based on a grammar, but rather, a model whose behavior depends on numerical parameters whose values are learned by examining the data in a training corpus. Decision trees use a series of successive, ordered queries to determine the meaning of a sentence. For example, the system can examine a text string on a word by word basis. At each word within the text string, the system can determine the word on either side of the current word to make a determination as to the meaning of the text string. Additional examples of queries can include “what is the word two words to the left of the current word?” or “what is the word two words to the right of the current word?” Thus, the system can learn that “2000” is probably part of a date when it follows “December 4”, but “2230” is probably a time. It learns this by examining the different dates and times observed in the training corpus.
Like the grammatical method, the statistical approach also has disadvantages. The first disadvantage is the time necessary to collect and annotate the training corpus. In particular, the training corpus will never contain a rich variety of dates and times. Hence one uses a Monte Carlo method to increase the amount of training data. Taking the previous example, the Monte Carlo method can generate the following text strings: “I want to fly on Tuesday, Dec. 5, 2000”, “I want to fly on Wednesday, Dec. 6, 2000”, and “I want to fly on Monday, Dec. 3, 2001”. The method can also generate different training sentences where only the date syntax differs within each sentence. The possible permutations, each of which is likely to occur, is virtually limitless. Using this method for the number of iterations necessary to train the NLU system to recognize different date syntaxes can become inefficient and time consuming. Also, the Monte Carlo method must generate examples of the different date styles in a consistent way. For example, the Monte Carlo procedure should not generate “Dec., 4 2000” for a sentence containing a date time. Thus, using a Monte Carlo procedure can be error prone.