1. Technical Field
The present invention relates generally to speech recognition and, in particular, to a method and system for combining language model scores generated by a language model mixture in an Automatic Speech Recognition system.
2. Description of Related Art
In general, an Automatic Speech Recognition (ASR) system includes a vocabulary, an acoustic model, and a language model (LM). The vocabulary is a table of words, with each word represented as a sequence of phones which are combined to form the pronunciation of the word. The acoustic model constructs a list of candidate words given the acoustic data. The language model predicts the current word using its word context.
The language model generally includes a collection of conditional probabilities corresponding to the combining of words in the vocabulary. The task of the language model is to express the restrictions imposed on the way in which words can be combined to form sentences. Some of the most popular language models are n-gram models which make the assumption that the a priori probability of a word sequence can be decomposed into conditional probabilities of each word given the n words preceding it. In the context of n-gram language models, a trigram is a string of three consecutive words (denoted by w1 w2 w3). Similarly, a bigram is a string of two consecutive words, and a unigram is a single word. The conditional probability of the trigram model may be expressed as follows: Prob(w 3|w2w1).
Generally, a trigram language model is trained using a transcription consisting of a large text corpus. The corpus consists of sentences, which nominally correspond to individual utterances that a speaker might produce in the context of a particular task. The training involves inputting the sentences and determining statistics for each word model in a manner which enhances the probability of the correct word relative to the probabilities associated with other words. As is known, such training provides counts for all trigrams, bigrams and unigrams identified in the corpus. The count of a given n-gram is the number of occurrences of the given n-gram in the corpus (word frequency).
The training of the language model results in determining the likelihood of each ordered triplet of words, ordered pair of words, or single words in the vocabulary. From these likelihoods, a list of the most likely triplets of words and a list of the most likely pairs of words are formed. Additionally, the likelihood of a triplet not being in the triplet list and the likelihood of a pair not being in the pair list are determined.
The probability assigned by the language model to a subject word will now be described. When a subject word follows two words, a determination is made as to whether the subject word and the two preceding words are on the most likely triplet list described above with reference to the training of the language model. If so, the stored probability assigned to the triplet is indicated. If the subject word and its two predecessors are not on the triplet list, a determination is made as to whether the subject word and its adjacent predecessor are on the most likely pairs of words list described above. If so, the probability of the pair is multiplied by the probability of a triplet not being the triplet list, and the product is assigned to the subject word. If the subject word and its predecessor(s) are not on the triplet list or pair list, the probability of the subject word alone is multiplied by the likelihood of a triplet not being on the most likely triplet list and by the probability of a pair not being on the most likely pair list. The product is assigned to the subject word.
Thus, the language model is used to enhance the probability of a correct word selection during the decoding process. This is because while most of the candidate words selected by the fast match module (described below) will be acoustically similar to the correct word, some of the candidate words may be removed from further consideration based on linguistics. For example, in the context of the following two words by the (w1 w2), the acoustic fast match list for the correct word way (w3) might include linguistically unlikely words such as say, ray, and may.
FIG. 1 is a block diagram illustrating an Automatic Speech Recognition (ASR) system 100 having a language model mixture according to the prior art. The ASR system 100 includes: an acoustic front-end 110; a Fast Match (FM) 112; a set of language models 114; a first combining module 116; and a second combining module 118.
Acoustic data, produced by the acoustic front-end 110, is processed by the Fast Match module 112 to construct a list of probable words for the current position in a word sequence. Previously recognized words are used by the set of language models 114 to predict the current word. Each of the language models assigns a score to each of the words predicted by the Fast Match Module 112. The scores produced by the individual language models are combined by the first combining module 116 to produce a single language model score for each word predicted by the Fast Match Module 112. The language model score and fast match score for each word are then combined by the second combining module 118.
The interpolated trigram models used in IBM's VIA VOICE language models are a weighted mixture of a raw trigram, bigram, unigram and uniform probability model. The weights are dependent on “buckets” that depend on the immediate history w1w2 of a word w3 in a word triplet w1w2w3. The weights are expected to change for the different “buckets” so as to make the trigram model more important for word pairs w1w2 that were frequently seen and less important when w1w2 were less frequently seen in the training corpus. Similarly, it is known that weighted mixtures of language models can be formed and the weights estimated by the Baum Welch algorithm. However, to the extent that language model mixtures have been used in the prior art, such use has been limited to dynamically mixing the layers (e.g., trigram, bigram, unigram layers) of a single language model and then combining the scores.
Although ASR systems with language models mixtures generally have a lower word error rate (WER) than ASR systems with a single language model, it is nonetheless desirable that the WER of the former systems be even lower. Accordingly, there is a need for a method and/or system for decoding a speech utterance by an ASR system with a language model mixture that has a reduced WER with respect to the prior art.