The present invention relates generally to speech recognition systems. More particularly, the invention relates to speech model adaptation in a supervised system employing a corrective adaptation procedure that weights correct and incorrect models by a log likelihood ratio between current and best hypotheses.
Speech recognizers in popular use today employ speech models that contain data derived from training speakers. In many cases, training speech from these speakers is collected in advance and used to generate speaker independent models representing a cross section of the training speaker population. Later, when the speech recognizer is used, data extracted from speech of a new speaker is compared with the speaker independent models and the recognizer identifies the words in its lexicon that represent the best match between the new speech and the existing speech models.
If the new speaker's speech patterns are sufficiently similar to those of the training population, then the recognizer will do a reasonably good job of recognizing the new speaker's speech. However, if the new speaker has a strong regional accent or other speech idiosyncrasies that are not reflected in the training population, then recognition accuracy fails off significantly.
To enhance the reliability of the speech recognizer, many recognition systems implement an adaptation process whereby adaptation speech is provided by the new speaker, and that adaptation speech is used to adjust the speech model parameters so that they more closely represent the speech of the new speaker. Some systems require a significant quantity of adaptation speech. New speakers are instructed to read long passages of text, so that the adaptation system can extract the necessary adaptation data to adapt the speech models.
Where the content of the adaptation speech is known in advance, the adaptation system is referred to as performing "supervised" adaptation. Where the content of the adaptation speech is not known in advance, the adaptation process is referred to as "unsupervised" adaptation. In general, supervised adaptation will provide better results than unsupervised adaptation. Supervised techniques are based on the knowledge of the adaptation data transcriptions, whereas unsupervised techniques determine the transcriptions of the adaptation data automatically, using the best models available, and consequently provide often limited improvements as compared to supervised techniques.
Among the techniques available to perform adaptation, transformation-based adaptation (e.g., Maximum Likelihood Linear Regression or MLLR) and Bayesian techniques (e.g., Maximum A Posteriori or MAP) adaptation are most popular. While transformation-based adaptation provides a solution for dealing with unseen models, Bayesian adaptation uses a priori information from speaker independent models. Bayesian techniques are particularly useful in dealing with problems posed by sparse data. In practical applications, depending on the amount of adaptation available, transformation-based, Bayesian techniques or a combination of both may be chosen.
Given a small amount of adaptation data, one of the common challenges of supervised adaptation is to provide adapted models that accurately match a user's speaking characteristics and are discriminative. On the other hand, unsupervised adaptation has to deal with inaccuracy of the transcriptions and the selection of reliable information to perform adaptation. For both sets of techniques it is important to adjust the adaptation procedure to the amount of adaptation data available.
The present invention addresses the foregoing issue by providing a corrective adaptation procedure that employs discriminative training. The technique pushes incorrect models away from the correct model, rendering the recognition system more discriminative for the new speakers speaking characteristics. The corrective adaptation procedure will work with essentially any adaptation technique, including transformation-based adaptation techniques and Bayesian adaptation techniques, and others.
The corrective adaptation procedure of the invention weights correct and incorrect speech models by a log likelihood ratio between the current model and the best hypothesis model. The system generates a set of N-best models and then analyzes these models to generate the log likelihood ratios. Because supervised adaptation is performed, and the correct label sequence is known, the N-best information is exploited by the system in a discriminative way. In the preferred system a positive weight is applied to the correct label and a negative weight is applied to all other labels.
In comparison with other discriminative methods, the corrective adaptation technique of the invention has several advantages. It is computationally inexpensive, and it is easy to implement. Moreover, the technique carries out discrimination that is specific to a given speaker, such that convergence is not an issue.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings. dr