The present invention relates generally to speech recognition systems and, more particularly, to methods and apparatus for performing enhanced likelihood computation using regression in speech recognition systems.
It is known that a continuous speech recognition system, such as the IBM continuous speech recognition system, uses a set of phonetic baseforms and context dependent models. These models are built by constructing decision tree networks that query the phonetic context to arrive at the appropriate models for the given context. A decision tree is constructed for every arc (sub-phonetic unit that corresponds to a state of the three state Hidden Markov Model or HMM). Each terminal node (leaf) of the tree represents a set of phonetic contexts, such that the feature vectors observed in these contexts were close together as defined by how well they fit a diagonal gaussian model. The feature vectors at each terminal node are modeled using a gaussian mixture density with each gaussian having a diagonal covariance matrix. The IBM system also uses a rank-based decoding scheme, as described in Bahl et. al, xe2x80x9cRobust-methods for using context-dependent features and models in a continuous speech recognizer,xe2x80x9d ICASSP 1994, Vol. 1, pp. 533-536, the disclosure of which is incorporated herein by reference. The rank r(l, t) of a leaf l at time t is the rank order of the likelihood given the mixture model of this leaf in the sorted list of likelihoods computed using all the models of all the leaves in the system and sorting them in descending order. In a rank-based system the output distributions on the state transitions of the model are expressed in terms of the rank of the leaf. Each transition with arc label a has a probability distribution on ranks which typically has a peak at rank one and rapidly falls off to low probabilities for higher ranks. The probability of rank r(l, t) for arc a is then used as the probability of generating the feature vector at time t on the transition with arc a.
The more number of times a correct leaf appears in the top rank positions, the better the recognition accuracy. In order to improve the rank of the correct leaf, its likelihood score has to be boosted up relative to other leaves. This implies that the likelihood score for the correct leaf will be increased while those of the incorrect leaves will be decreased. A scheme to increase the likelihood of the correct leaf that captures the correlation between adjacent vectors using correlation models was introduced in, P. F. Brown, xe2x80x9cThe Acoustic-Modeling Problem in Automatic Speech Recognition,xe2x80x9d Ph. D. thesis, IBM RC 12750, 1987.
The approach in the P. F. Brown thesis was to do away with the assumption that given the output distribution at time t, the acoustic observation at time t is independent of that at time txe2x88x921, or depends only on the transition taken at time t (P(yt|st)), where yt refers to the cepstral vector corresponding to the speech at time t, st refers to the transition at time t, and P(yt|st) refers to the likelihood (i.e., probability) of generating yt on the transition at time t, as understood by those skilled in the art. The manner in which ytxe2x88x921 differs from the mean of the output distribution from which it is generated, influences the way that yt differs from the mean of the output distribution from which it is generated, where ytxe2x88x921 refers to the cepstral vector corresponding to the speech at time txe2x88x921. This is achieved by conditioning the probability of generating yt on the transition at time t, the transition at time txe2x88x921 (i.e., stxe2x88x921) and ytxe2x88x921, that is:
P(yt|st,stxe2x88x921,ytxe2x88x921)xe2x80x83xe2x80x83(1)
Incorporating this into an HMM would in effect square the number of output distributions and also increase the number of parameters in each output distribution. When the training data is not sufficient, the benefit of introducing the correlation concept may not be seen. Alternatively the probability could be conditioned only on the transition taken at time t and the output at ytxe2x88x92t, that is:
P(yt|st,ytxe2x88x921)xe2x80x83xe2x80x83(2)
The output distribution for equation (2) has the form:
P(yt|st,ytxe2x88x921)=det Wxc2xd1/(2xcfx80)n/2exp[xe2x88x92xc2xd](Zxe2x80x2WZ)xe2x80x83xe2x80x83(3)
where W refers to covariance, and Z is given by:
(Ytxe2x88x92(xcexct+C(ytxe2x88x921xe2x88x92xcexctxe2x88x921)))xe2x80x83xe2x80x83(4)
where xcexc1 and xcexctxe2x88x921 refers to the mean at times i and txe2x88x921, respectively, as is known in the art, and C is the regression matrix given by:
C=xcexa3(ytxc2x7ytxe2x88x921)/|yt|2xe2x80x83xe2x80x83(5)
This form only increases the number of parameters in each output distribution and not the number of output distributions, making it computationally attractive. However, from a modeling perspective, it is less accurate than equation (1) because the distribution from which ytxe2x88x921 was generated and its deviation from its mean are unknown. There is an important trade-off between the complexity of an acoustic model and the quality of the parameters in that model. The greater the number of parameters in a model, the more variance there will be in the estimates of the probabilities of these acoustic events.
It would be highly desirable to provide techniques for use in speech recognition systems for enhancing the likelihood computation while minimizing or preventing an increase in the complexity of the HMMs.
The present invention provides for methods and apparatus for improving recognition performance in a speech recognition system by improving likelihood computation through the use of regression. That is, a methodology is provided that increases the likelihood of the correct leaf that captures the correlation between adjacent vectors using correlation models. According to the invention, regression techniques are used to capture such correlation. The regression predicts the neighboring frames of the current frame of speech. The prediction error likelihoods are then incorporated or smoothed into the overall likelihood computation to improve the rank position of the correct leaf, without increasing the complexity of the HMMs.
In an illustrative embodiment of the invention, a method for use with a speech recognition system in processing a plurality of frames of a speech signal includes tagging feature vectors associated with each frame received in a training phase with best aligning gaussian distributions. Then, forward and backward regression coefficients are estimated for the gaussian distributions for each frame. The method further includes computing residual error vectors from the regression coefficients for each frame and then modeling the prediction errors to form a set of gaussian models for the speech associated with the each frame. The set of gaussian models are then used to calculate three sets of likelihood values for each frame of a speech signal received during a recognition phase.
Advantageously, in order to achieve low error rates in a speech recognition system, for example, in a system employing rank-based decoding, we discriminate the most confusable incorrect leaves from the correct leaf by lowering their ranks. That is, we increase the likelihood of the correct leaf of a frame, while decreasing the likelihoods of the confusable leaves. In order to do this, we use the auxiliary information from the prediction of the neighboring frames to augment the likelihood computation of the current frame. We then use the residual errors in the predictions of neighboring frames to discriminate between the correct (best) and incorrect leaves of a given frame. We present a new methodology that incorporates prediction error likelihoods into the overall likelihood computation to improve the rank position of the correct leaf.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.