Most speech recognition systems utilize a statistical model called the hidden Markov model (HMM). Such models consist of sequences of states connected by arcs, and a probability density function (pdf) associated with each state which describes the likelihood of observing any given feature vector at that state. A separate set of probabilities determines the transitions between the states. Most large vocabulary continuous recognition systems use continuous pdfs, which are parametric functions that describe the probability of any arbitrary input feature vector given a model state.
One drawback of using continuous pdfs is that the designer must make explicit assumptions about the nature of the pdfs being modeled—something which can be quite difficult since the true distribution form for the speech signal is not known. The most common class of functions used for this purpose is a mixture of Gaussians, where an arbitrary pdf is modeled by a weighted sum of normal distributions.
The model pdfs are most commonly trained using the maximum likelihood method. In this manner, the model parameters are adjusted so that the likelihood of observing the training data given the model is maximized. However, it is known that this approach does not necessarily lead to the best recognition performance. This problem can be addressed by discriminative training of the mixture models. The idea is to adjust the model parameters so as to minimize the number of recognition errors rather than fit the distributions to the data. One approach to discriminative training in a large vocabulary continuous speech recognition system is described in U.S. Pat. No. 6,490,555, the contents of which are incorporated herein by reference.