1. Technical Field
This invention relates to the field of natural language understanding (NLU), and more particularly, to including statistical NLU models within a statistical parser.
2. Description of the Related Art
NLU systems enable computers to understand and extract information from human written or spoken language. Such systems can function in a complimentary manner with a variety of other computer applications where there exists a need to understand human language. NLU systems can extract relevant information contained within text and then supply this information to another application program or system for purposes such as booking flight reservations, finding documents, or summarizing text.
Currently within the art, either a grammatical approach or a statistical approach can be used for NLU. Within the statistical approach, three algorithms, statistical parsing, maximum entropy models, and source channel models can be used for examining text in order to extract information. Statistical parsers can utilize phrases identified by one or more statistical phrase models as queries which then can be ordered as a decision tree. Maximum entropy models can use the preprocessed phrases as features which can be assigned weights. To function efficiently, NLU systems must first be trained to correctly parse future text inputs. The training process involves supplying the NLU system with a large quantity of annotated text, referred to as a training corpus. By examining the annotated training corpus, statistical models can be constructed which learn to parse future text.
presently such systems can require thousands of sentences of training data. One alternative is to use the Monte Carlo method of generating a large number of sentences. The Monte Carlo method can produce randomly generated text or sentences for use as a training corpus. Using the Monte Carlo method, the NLU system can be statistically trained by generating many possible permutations of conflict situations. A conflict situation is where a word can be construed by the NLU system as belonging to more than one phrase. From the large amount of training data, the NLU system can build a decision tree to analyze text strings and resolve conflict situations. Decision trees use a series of successive, ordered queries to determine the meaning of a sentence. For example, the NLU system can examine a text string on a word by word basis. At each word within the text string, the NLU system can determine the word on either side of the current word to make a determination as to the meaning of the text string. Additional examples of queries can include “what is the word two words to the left of the current word?” or “what is the word two words to the right of the current word?”
Unfortunately, the above described statistical approach can have disadvantages. One disadvantage is that the Monte Carlo method itself takes time to generate a training corpus from which the NLU system can be trained. Moreover, the Monte Carlo method necessarily generates a large amount of training data. For example, from the text input “I want to fly on Monday, Dec. 4, 2000”, the Monte Carlo method can generate the following text strings: “I want to fly on Tuesday, Dec. 5, 2000”, “I want to fly on Wednesday, Dec. 6, 2000”, and “I want to fly on Monday, Dec. 3, 2001”. The Monte Carlo method also can generate different training sentences where only the date syntax differs within each sentence. The possible permutations, each of which is equally likely to occur, can be virtually limitless. Continuing this method for the number of iterations necessary to train the NLU system to recognize different, but equally likely, dates and syntaxes can become inefficient and time consuming. Another disadvantage to present statistical approaches is that once the training data has been generated, that training data must also be annotated for grammatical phrases and words. Finally, each time an NLU application is built, enough training data must be collected and annotated so that the model parameters can be trained. Thus, training each NLU application can be inefficient and time consuming, as well as redundant.
Grammatical approaches to NLU can have disadvantages as well. The grammatical approach to NLU incorporates grammars for recognizing text strings. Notably, the grammars used within NLU systems tend to be application specific, and thus, difficult to reuse across multiple applications. Another disadvantage to grammars can be the need for linguistic experts to develop suitable grammars for an NLU application. Use of linguistic experts can significantly impact NLU application development due to the extra developmental step and the added cost.