The invention generally relates to automatic speech recognition, and more particularly, to a technique for adjusting the mixture components of hidden Markov models as used in automatic speech recognition.
The goal of automatic speech recognition (ASR) systems is to determine the lexical identity of spoken utterances. The recognition process, also referred to as classification, begins with the conversion of the acoustical signal into a stream of spectral vectors or frames that describe the important characteristics of the signal at specified times. Classification is attempted by first creating reference models that describe some aspect of the behavior of spectral frames corresponding to different words.
A wide variety of models have been developed, but they all share the property that they describe the temporal characteristics of spectra typical to particular words or sub-word segments. The sequence of spectra arising from an input utterance is compared to such models, and the success with which different models predict the behavior of the input frames determines the putative identity of the utterance.
Currently most systems utilize some variant of a statistical model called the hidden Markov model (HMM). Such models consist of sequences of states connected by arcs, and a probability density function (pdf) associated with each state which describes the likelihood of observing any given spectral vector at that state. A separate set of probabilities determines transitions between states.
Various levels of modeling power are available in the case of the probability densities describing the observed spectra associated with the states of the HMM. There are two major approaches: the discrete pdf and the continuous pdf. With continuous pdfs, parametric functions specify the probability of any arbitrary input spectral vector given a state. The most common class of functions used for this purpose is a mixture of Gaussians, where arbitrary pdfs are modeled by a weighted sum of normal distributions. One drawback of using continuous pdfs is that the designer must make explicit assumptions about the nature of the pdf being modeledxe2x80x94something which can be quite difficult since the true distribution form for the speech signal is not known. In addition, continuous pdf models are computationally far more expensive than discrete pdf models.
The total number of pdfs in a recognition system depends on the number of distinct HMM states, which in turn is determined by type of models usedxe2x80x94e.g., phonetic or word models. In many systems the states from different models can be pooledxe2x80x94i.e., the states from different models can share pdfs from a common set or pool. For example, some states from two different models that represent a given phone in different phonetic contexts (i.e., an allophone) may have similar pdfs. In some systems these pdfs will be combined into one, to be shared by both states. This may be done to save memory and in some instances to overcome a problem known as undertraining.
The model pdfs, whether discrete or continuous, are most commonly trained using the maximum likelihood method. In this manner, the model parameters are adjusted so that the likelihood of observing the training data given the model is maximized. However, it is known that this approach does not necessarily lead to the best recognition performance. This realization has led to the development of new training criteria, known as discriminative, the objective of which is to adjust model parameters so as to minimize the number of recognition errors rather than fit the distributions to the data.
FIG. 1 shows a feature vector 10 representative of an input speech frame in a multidimensional vector space, a xe2x80x9ccorrectxe2x80x9d state SC 11 from the model that corresponds to the input speech, and an xe2x80x9cincorrectxe2x80x9d state SI 12 from a model that does not correspond to the input speech. As shown in FIG. 1, the vector space distance from the feature vector 10 to the best branch 13 (the closest mixture component) of correct state SC 11, is very nearly the same as the vector space distance from the feature vector 10 to the best branch 14 of the incorrect state SI 12. In this situation, there is very little basis at the state level for distinguishing the correct state SC 11 from the incorrect state SI 12.
Discriminative training attempts to adjust the best branch 13 of correct state SC 11 a little closer to the vector space location of feature vector 10, and adjust the best branch 14 of the incorrect state SI 12 a little farther from the vector space location of feature vector 10. Thus, a future feature vector near the vector space of feature vector 10 will be more likely to be identified with correct state SC 11 than with incorrect state SI 12. Of course discriminative training may adjust the vector space of the correct state with respect to multiple incorrect states. Similarly, rather than adjusting the best branches of the states, a set of mixture components within each state may be adjusted.
While discriminative training shows considerable promise, so far it has been applied most successfully to small vocabulary and isolated word recognition tasks. In addition, discriminative training presents a number of new problems, such as how to appropriately smooth the discriminatively trained pdfs, and how to adapt these systems to a new user with a relatively small amount of training data.
U.S. Pat. No. 6,260,013 describes a system using discriminatively trained multi-resolution models in the context of an isolated word recognition system. However, the techniques described therein are not efficiently extensible to a continuous speech recognition system.
A representative embodiment of the present invention includes a method of a continuous speech recognition system for discriminatively training hidden Markov for a system recognition vocabulary. An input word phrase is converted into a sequence of representative frames. A correct state sequence alignment with the sequence of representative frames is determined, the correct state sequence alignment corresponding to models of words in the input word phrase. A plurality of incorrect recognition hypotheses is determined representing words in the recognition vocabulary that do not correspond to the input word phrase, each hypothesis being a state sequence based on the word models in the acoustic model database. A correct segment of the correct word model state sequence alignment is selected for discriminative training. A frame segment of frames in the sequence of representative frames is determined that corresponds to the correct segment. An incorrect segment of a state sequence in an incorrect recognition hypothesis is selected, the incorrect segment corresponding to the frame segment. A discriminative adjustment is performed on selected states in the correct segment and the corresponding states in the incorrect segment.
In a further embodiment, performing a discriminative adjustment occurs in a batch training mode at the end of a user session with the speech recognition system, and the discriminative adjustment performed on the selected and corresponding states represents a sum of calculated adjustments over the session. Alternatively, performing a discriminative adjustment may occur in an on-line mode in which the selected and corresponding states are discriminatively adjusted for each input word phrase.
Performing a discriminative adjustment may include using a language model weighting of the selected and corresponding states, in which case, when the selected segment of an incorrect recognition hypothesis is a fractional portion of a word model state sequence, the language model weighting for the fractional portion corresponds to the fractional amount of the word model that the fractional portion represents. The discriminative adjustment may include performing a gradient adjustment to selected branches of a selected state in the correct hypothesis model and a corresponding state in the incorrect hypothesis. The gradient adjustment may be to the best branch in each state model.