The present invention relates to language modeling. More particularly, the present invention relates to a language processing system utilizing a unified language model.
Accurate speech recognition requires more than just an acoustic model to select the correct word spoken by the user. In other words, if a speech recognizer must choose or determine which word has been spoken, if all words have the same likelihood of being spoken, the speech recognizer will typically perform unsatisfactorily. A language model provides a method or means of specifying which sequences of words in the vocabulary are possible, or in general provides information about the likelihood of various word sequences.
One form of a language model that has been used is a unified language model. The unified language model is actually a combination of an N-gram language model (hybrid N-gram language model) and a plurality of context-free grammars. In particular, the plurality of context-free grammars is used to define semantic or syntactic concepts of sentence structure or spoken language using non-terminal tokens to represent the semantic or syntactic concepts. Each non-terminal token is defined using at least terminals and, in some instances, other non-terminal tokens in a hierarchical structure. The hybrid N-gram language model includes at least some of the same non-terminals of the the plurality of context-free grammars embedded therein such that in addition to predicting terminals or words, the N-gram language model also can predict non-terminals.
Current implementation of the unified language model in a speech recognition system uses a conventional terminal based N-gram model to generate hypotheses for the utterance to be recognized. As is well known, during the speech recognition process, the speech recognition system will explore various hypotheses of shorter sequences of possible words, and based on probabilities obtained from the conventional terminal based N-gram model, discard those yielding lower probabilities. Longer hypotheses are formed for the utterance and initial language model scores are calculated using the conventional terminal based N-gram model.
Commonly, the language model scores are combined with the acoustic model score to provide a total score for each hypothesis. The hypotheses are then ranked from highest to lowest based on their total scores. The unified language model is then applied to each of the hypotheses, or a subset thereof, to calculate new language model scores, which are then combined with the acoustic model score to provide new total scores. The hypotheses are then re-ranked based on the new total scores, wherein the highest is considered to correspond to the utterance. However, since some hypotheses were discarded during the search process, upon recalculation of the language model scores with the unified language model, the correct hypothesis could have been discarded, and therefore, will not make it into the list of hypotheses. Use of a unified language model which has the potential to be more accurate than the conventional word-based N-gram directly during the search process can help in preventing such errors.
Although speech recognition systems have been used in the past to simply provide textual output corresponding to a spoken utterance, there is a desire to use spoken commands to perform various actions with a computer. Typically, the textual output from the speech recognition system is provided to a natural language parser, which attempts to ascertain the meaning or intent of the utterance in order to perform a particular action. This structure therefore requires creation and fine-tuning of the speech recognition system as well as creation and fine-tuning of the natural language parser, both of which can be tedious and time consuming.
There is thus a continuing need for a language processing system that addresses one or both of the problems discussed above.